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Foreword 


The publication of this book is a perfect timing. Existing books on deep learning 
either focus on theoretical aspects or are largely manuals for tools. But this book 
presents an unprecedented analysis and comparison of deep learning techniques for 
natural language and speech processing, closing the substantial gap between the- 
ory and practice. Each chapter discusses the theory underpinning the topics, and an 
exceptional collection of 13 case studies in different application areas is presented. 
They include classification via distributed representation, summarization, machine 
translation, sentiment analysis, transfer learning, multitask NLP, end-to-end speech, 
and question answering. Each case study includes the implementation and compar- 
ison of state-of-the-art techniques, and the accompanying website provides source 
code and data. This is extraordinarily valuable for practitioners, who can experiment 
firsthand with the methods and can deepen their understanding of the methods by 
applying them to real-world scenarios. 

This book offers a comprehensive coverage of deep learning, from its foundations 
to advanced and recent topics, including word embedding, convolutional neural net- 
works, recurrent neural networks, attention mechanisms, memory-augmented net- 
works, multitask learning, domain adaptation, and reinforcement learning. The book 
is a great resource for practitioners and researchers both in industry and academia, 
and the discussed case studies and associated material can serve as inspiration for a 
variety of projects and hands-on assignments in a classroom setting. 


Associate Professor at GMU Carlotta Domeniconi, PhD 
Fairfax, VA, USA 
February 2019 


Natural language and speech processing applications such as virtual assistants and 
smart speakers play an important and ever-growing role in our lives. At the same 
time, amid an increasing number of publications, it is becoming harder to iden- 
tify the most promising approaches. As the Chief Analytics Officer at Digital Rea- 
soning and with a PhD in Big Data Machine Learning, Uday has access to both 
the practical and research aspects of this rapidly growing field. Having authored 
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Mastering Java Machine Learning, he is uniquely suited to break down both practi- 
cal and cutting-edge approaches. This book combines both theoretical and practical 
aspects of machine learning in a rare blend. It consists of an introduction that makes 
it accessible to people starting in the field, an overview of state-of-the-art methods 
that should be interesting even to people working in research, and a selection of 
hands-on examples that ground the material in real-world applications and demon- 
strate its usefulness to industry practitioners. 


Research Scientist at DeepMind Sebastian Ruder, PhD 
London, UK 
February 2019 


A few years ago, I picked up a few text-books to study topics related to arti- 
ficial intelligence—such as natural language processing and computer vision. My 
memory of reading these text-books largely consisted of staring helplessly out of 
the window. Whenever I attempted to implement the described concepts and math, 
I wouldn’t know where to start. This is fairly common in books written for aca- 
demic purposes; they mockingly leave the actual implementation “as an exercise to 
the reader.” There are a few exceptional books that try to bridge this gap, written 
by people who know the importance of going beyond the math all the way to a 
working system. This book is one of those exceptions—with it’s discussions, case 
studies, code snippets, and comprehensive references, it delightfully bridges the gap 
between learning and doing. 

I especially like the use of Python and open-source tools out there. It’s an opin- 
ionated take on implementing machine learning systems—one might ask the fol- 
lowing question: “Why not X,” where X could be Java, C++, or Matlab? However, 
I find solace in the fact that it’s the most popular opinion, which gives the read- 
ers an immense support structure as they implement their own ideas. In the mod- 
ern Internet-connected world, joining a popular ecosystem is equivalent to having 
thousands of humans connecting together to help each other—from Stack Overflow 
posts solving an error message to GitHub repositories implementing high-quality 
systems. To give you perspective, I’ve seen the other side, supporting a niche com- 
munity of enthusiasts in machine learning using the programming language Lua for 
several years. It was a daily struggle to do new things—even basic things such as 
making a bar chart—precisely because our community of people was a few orders 
of magnitude smaller than Python’s. 

Overall, I hope the reader enjoys a modern, practical take on deep learning sys- 
tems, leveraging open-source machine learning systems heavily, and being taught 
a lot of “tricks of the trade” by the incredibly talented authors, one of whom I’ve 
known for years and have seen build robust speech recognition systems. 


Research Engineer at Facebook AI Research (FAIR) Soumith Chintala, PhD 
New York, NY, USA 
February 2019 


Preface 


Why This Book? 


With the widespread adoption of deep learning, natural language processing (NLP), 
and speech applications in various domains such as finance, healthcare, and gov- 
ernment and across our daily lives, there is a growing need for one comprehensive 
resource that maps deep learning techniques to NLP and speech and provides in- 
sights into using the tools and libraries for real-world applications. Many books 
focus on deep learning theory or deep learning for NLP-specific tasks, while oth- 
ers are cookbooks for tools and libraries. But, the constant flux of new algorithms, 
tools, frameworks, and libraries in a rapidly evolving landscape means that there are 
few available texts that contain explanations of the recent deep learning methods 
and state-of-the-art approaches applicable to NLP and speech, as well as real-world 
case studies with code to provide hands-on experience. As an example, you would 
find it difficult to find a single source that explains the impact of neural attention 
techniques applied to a real-world NLP task such as machine translation across a 
range of approaches, from the basic to the state-of-the-art. Likewise, it would be 
difficult to find a source that includes accompanying code based on well-known li- 
braries with comparisons and analysis across these techniques. 

This book provides the following all in one place: 


e A comprehensive resource that builds up from elementary deep learning, text, 
and speech principles to advanced state-of-the-art neural architectures 


e A ready reference for deep learning techniques applicable to common NLP and 
speech recognition applications 


e A useful resource on successful architectures and algorithms with essential math- 
ematical insights explained in detail 


e An in-depth reference and comparison of the latest end-to-end neural speech 
processing approaches 
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e A panoramic resource on leading-edge transfer learning, domain adaptation, and 
deep reinforcement learning architectures for text and speech 


e Practical aspects of using these techniques with tips and tricks essential for real- 
world applications 


e A hands-on approach in using Python-based libraries such as Keras, TensorFlow, 
and PyTorch to apply these techniques in the context of real-world case studies 


In short, the primary purpose of this book is to provide a single source that addresses 
the gap between theory and practice using case studies with code, experiments, and 
supporting analysis. 


Who Is This Book for? 


This book is intended to introduce the foundations of deep learning, natural lan- 
guage processing, and speech, with an emphasis on application and practical expe- 
rience. It is aimed at NLP practitioners, graduate students in Engineering and Com- 
puter Science, advanced undergraduates, and anyone with the appropriate mathe- 
matical background who is interested in an in-depth introduction to the recent deep 
learning approaches in NLP and speech. Mathematically, we expect that the reader 
is familiar with multivariate calculus, probability, linear algebra, and Python pro- 
gramming. 

Python is becoming the lingua franca of data scientists and researchers for per- 
forming experiments in deep learning. There are many libraries with Python-enabled 
bindings for deep learning, NLP, and speech that have sprung up in the last few 
years. Therefore, we use both the Python language and its accompanying libraries 
for all case studies in this book. As it is unfeasible to fully cover every topic in a 
single book, we present what we believe are the key concepts with regard to NLP 
and speech that will translate into application. In particular, we focus on the inter- 
section of those areas, wherein we can leverage different frameworks and libraries 
to explore modern research and related applications. 


What Does This Book Cover? 


The book is organized into three parts, aligning to different groups of readers and 
their expertise. The three parts are: 


e Machine Learning, NLP, and Speech Introduction. The first part has three 
chapters that introduce readers to the fields of NLP, speech recognition, deep 
learning, and machine learning with basic hands-on case studies using Python- 
based tools and libraries. 
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Deep Learning Basics. The five chapters in the second part introduce deep learn- 
ing and various topics that are crucial for speech and text processing, including 
word embeddings, convolutional neural networks, recurrent neural networks, and 
speech recognition basics. 

Advanced Deep Learning Techniques for Text and Speech. The third part 
has five chapters that discuss the latest research in the areas of deep learning that 
intersect with NLP and speech. Topics including attention mechanisms, memory- 
augmented networks, transfer learning, multitask learning, domain adaptation, 
reinforcement learning, and end-to-end deep learning for speech recognition are 
covered using case studies. 


Next, we summarize the topics covered in each chapter. 


In the Introduction, we introduce the readers to the fields of deep learning, NLP, 
and speech with a brief history. We present the different areas of machine learn- 
ing and detail different resources ranging from books to datasets to aid readers in 
their practical journey. 


The Basics of Machine Learning chapter provides a refresher of basic theory 
and important practical concepts. Topics covered include the learning process, 
supervised learning, data sampling, validation techniques, overfitting and under- 
fitting of the models, linear and nonlinear machine learning algorithms, and se- 
quence data modeling. The chapter ends with a detailed case study using struc- 
tured data to build predictive models and analyze results using Python tools and 
libraries. 


In the Text and Speech Basics chapter, we introduce the fundamentals of com- 
putational linguistics and NLP to the reader, including lexical, syntactic, seman- 
tic, and discourse representations. We introduce language modeling and discuss 
applications such as text classification, clustering, machine translation, question 
answering, automatic summarization, and automated speech recognition, con- 
cluding with a case study on text clustering and topic modeling. 


The Basics of Deep Learning chapter builds upon the machine learning founda- 
tion by introducing deep learning. The chapter begins with a fundamental anal- 
ysis of the components of deep learning in the multilayer perceptron (MLP), 
followed by variations on the basic MLP architecture and techniques for training 
deep neural networks. As the chapter progresses, it introduces various architec- 
tures for both supervised and unsupervised learning, such as multiclass MLPs, 
autoencoders, and generative adversarial networks (GANs). Finally, the mate- 
rial is combined into the case study, analyzing both supervised and unsupervised 
neural network architectures on a spoken digit dataset. 


For the Distributed Representations chapter, we investigate distributional 
semantics and word representations based on vector space models such as 
word2vec and GloVe. We detail the limitations of word embeddings including 
antonymy and polysemy and the approaches that can overcome them. We also 
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investigate extensions of embedding models, including subword, sentence, con- 
cept, Gaussian, and hyperbolic embeddings. We finish the chapter with a case 
study that dives into how embedding models are trained and their applicability 
to document clustering and word sense disambiguation. 


The Convolutional Neural Networks chapter walks through the basics of con- 
volutional neural networks and their applications to NLP. The main strand of 
discourse in the chapter introduces the topic by starting from fundamental math- 
ematical operations that form the building blocks, explores the architecture in 
increasing detail, and ultimately lays bare the exact mapping of convolutional 
neural networks to text data in its various forms. Several topics such as clas- 
sic frameworks from the past, their modern adaptations, applications to different 
NLP tasks, and some fast algorithms are also discussed in the chapter. The chap- 
ter ends with a detailed case study using sentiment classification that explores 
most of the algorithms mentioned in the chapter with practical insights. 


The Recurrent Neural Networks chapter presents recurrent neural networks 
(RNNs), allowing the incorporation of sequence-based information into deep 
learning. The chapter begins with an in-depth analysis of the recurrent connec- 
tions in deep learning and their limitations. Next, we describe basic approaches 
and advanced techniques to improve performance and quality in recurrent mod- 
els. We then look at some applications of these architectures and their application 
in NLP and speech. Finally, we conclude with a case study applying and com- 
paring RNN-based architectures on a neural machine translation task, analyzing 
the effects of the network types (RNN, GRU, LSTM, and Transformer) and con- 
figurations (bidirectional, number of layers, and learning rate). 


The Automatic Speech Recognition chapter describes the fundamental ap- 
proaches to automatic speech recognition (ASR). The beginning of the chap- 
ter focuses on the metrics and features commonly used to train and validate 
ASR systems. We then move toward describing the statistical approach to speech 
recognition, including the base components of an acoustic, lexicon, and language 
model. The case study focuses on training and comparing two common ASR 
frameworks, CMUSphinx and Kaldi, on a medium-sized English transcription 
dataset. 


The Attention and Memory-Augmented Networks chapter introduces the 
reader to the attention mechanisms that have played a significant role in neural 
techniques in the last few years. Next, we introduce the related topic of memory- 
augmented networks. We discuss most of the neural-based memory networks, 
ranging from memory networks to the recurrent entity networks in enough detail 
for the user to understand the working of each technique. This chapter is unique 
as it has two case studies, the first one for exploring the attention mechanism 
and the second for memory networks. The first case study extends the machine 
translation case study started in Chap. 7 to examine the impact of different atten- 
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tion mechanisms discussed in this chapter. The second case study explores and 
analyzes different memory networks on the question-answering NLP task. 


e The Transfer Learning: Scenarios, Self-Taught Learning, and Multitask 
Learning chapter introduces the concept of transfer learning and covers multi- 
task learning techniques extensively. This case study explores multitask learning 
techniques for NLP tasks such as part-of-speech tagging, chunking, and named 
entity recognition and analysis. Readers should expect to gain insights into real, 
practical aspects of applying the multitask learning techniques introduced here. 


e The Transfer Learning: Domain Adaptation chapter probes into the area of 
transfer learning where the models are subjected to constraints such as having 
fewer data to train on, or situations when data on which to predict is different 
from data it has trained on. Techniques for domain adaptation, few-shot learning, 
one-shot learning, and zero-shot learning are covered in this chapter. A detailed 
case study is presented using Amazon product reviews across different domains 
where many of the techniques discussed are applied. 


e The End-to-End Speech Recognition chapter combines the ASR concepts in 
Chap. 8 with the deep learning techniques for end-to-end recognition. This chap- 
ter introduces mechanisms for training end-to-end sequence-based architectures 
with CTC and attention, as well as explores decoding techniques to improve 
quality further. The case study extends the one presented in Chap. 8 by using the 
same dataset to compare two end-to-end techniques, Deep Speech 2 and ESPnet 
(CTC-Attention hybrid training). 


e In the Deep Reinforcement for Text and Speech chapter, we review the funda- 
mentals of reinforcement learning and discuss their adaptation to deep sequence- 
to-sequence models, including deep policy gradient, deep Q-learning, double 
DQN, and DAAC algorithms. We investigate deep reinforcement learning ap- 
proaches to NLP tasks including information extraction, text summarization, ma- 
chine translation, and automatic speech recognition. We conclude with a case 
study on the application of deep policy gradient and deep Q-learning algorithms 
to text summarization. 
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Notation 


Calculus 


|Al 


Datasets 
D 


96 

g 

Ji 

Vi 

L 

Q 
Functions 
f:A-B 


f(x; 6) 


logx 
o(a) 


la #5] 


arg min, f (x) 


arg max, f (x) 


Approximately equal to 

L, norm of matrix A 

L» norm of matrix A 

Derivative of a with respect to b 


Partial derivative of a with respect to b 
Gradient of Y with respect to x 
Matrix of derivatives of Y with respect to X 


Dataset, a set of examples and corresponding targets, {(x1,y1), 
(X2,y2), aoe) (Xn, Vn) t 

Space of all possible inputs 

Space of all possible outputs 

Target label for example i 

Predicted label for example i 

Log-likelihood loss 

Learned parameters 


A function f that maps a value in the set A to set B 

A function of x parameterized by @. This is frequently reduced to 
f(x) for notational clarity. 

Natural log of x 
Logistic sigmoid, SESE 

A function that yields a 1 if the condition contained is true, otherwise 
it yields O 

Set of arguments that minimize f(x), argmin, f(x) = {x | f(x) = 
min, f (x') } 

Set of arguments that maximize f(x), argmax, f(x) = {x | f(x) = 


max, f(x’)} 
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Notation 


Linear Algebra 


a:b 


Scalar value (integer or real) 


Vector containing elements a to dy, 


A matrix with m rows and n columns 


Value of matrix A at row 7 and column / 
Vector (dimensions implied by context) 
Matrix (dimensions implied by context) 
Transpose of matrix A 

Inverse of matrix A 

Identity matrix (dimensionality implied by context) 
Dot product of matrices A and B 

Cross product of matrices A and B 
Element-wise (Hadamard) product 
Kronecker product of matrices A and B 
Concatenation of vectors a and b 


Expected value 
Probability of event A 
Probability of event A given event B 


X ~.W(u, 0%) Random variable X sampled from a Gaussian (Normal) distribution 


with {4 mean and 07 variance 


A set 

Set of real numbers 

Set of complex numbers 

Empty set 

Set containing the elements a and b 

Set containing all integers from 1 ton 

Set containing n elements 

Value a is a member of the set A 

Set of real values from a to b, including a and b 

Set of real values from a to b, including a but excluding b 
Set of elements {a,a2,...,d@m} (used for notational convenience) 


Most of the chapters, unless and otherwise specified, assume the notations given 


above. 


Part I 
Machine Learning, NLP, and Speech 
Introduction 


Chapter 1 ®@ 
Introduction Sheet for 


In recent years, advances in machine learning have led to significant and widespread 
improvements in how we interact with our world. One of the most portentous of 
these advances is the field of deep learning. Based on artificial neural networks that 
resemble those in the human brain, deep learning is a set of methods that permits 
computers to learn from data without human supervision and intervention. Further- 
more, these methods can adapt to changing environments and provide continuous 
improvement to learned abilities. Today, deep learning is prevalent in our every- 
day life in the form of Google’s search, Apple’s Siri, and Amazon’s and Netflix’s 
recommendation engines to name but a few examples. When we interact with our 
email systems, online chatbots, and voice or image recognition systems deployed at 
businesses ranging from healthcare to financial services, we see robust applications 
of deep learning in action. 

Human communication is at the core of developments in many of these areas, and 
the complexities of language make computational approaches increasingly difficult. 
With the advent of deep learning, however, the burden shifts from producing rule- 
based approaches to learning directly from the data. These deep learning techniques 
open new fronts in our ability to model human communication and interaction and 
improve human—computer interaction. 

Deep learning saw explosive growth, attention, and availability of tools following 
its success in computer vision in the early 2010s. Natural language processing soon 
experienced many of these same benefits from computer vision. Speech recognition, 
traditionally a field dominated by feature engineering and model tuning techniques, 
incorporated deep learning into its feature extraction methods resulting in strong 
gains in quality. Figure 1.1 shows the popularity of these fields in recent years. 

The age of big data is another contributing factor to the performance gains with 
deep learning. Unlike many traditional learning algorithms, deep learning models 
continue to improve with the amount of data provided, as illustrated in Fig. 1.2. 

Perhaps one of the largest contributors to the success of deep learning is the active 
community that has developed around it. The overlap and collaboration between 
academic institutions and industry in the open source has led to a virtual cornucopia 
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Fig. 1.1: Google trends for deep learning, natural language processing, and speech 
recognition in the last decade 
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Fig. 1.2: Deep learning benefits heavily from large datasets 


of tools and libraries for deep learning. This overlap and influence of the academic 
world and the consumer marketplace has also led to a shift in the popularity of 
programming languages, as illustrated in Fig. 1.3, specifically towards Python. 

Python has become the go-to language for many analytics applications, due to 
its simplicity, cleanliness of syntax, multiple data science libraries, and extensibility 
(specifically with C++). This simplicity and extensibility have led to most top deep 
learning frameworks to be built on Python or adopt Python interfaces that wrap 
high-performance C++ and GPU-optimized extensions. 

This book seeks to provide the reader an in-depth overview of deep learning tech- 
niques in the fields of text and speech processing. Our hope is for the reader to walk 
away with a thorough understanding of natural language processing and leading- 
edge deep learning techniques that will provide a basis for all text and speech pro- 
cessing advancements in the future. Since “practice makes for a wonderful compan- 
10n,’ each chapter in this book is accompanied with a case study that walks through 
a practical application of the methods introduced in the chapter. 
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Fig. 1.3: Google trends for programming languages such as Java, Python, and R 
which are used in data science and deep learning in the last decade 


1.1 Machine Learning 


Machine learning is quickly becoming commonplace in many of the applications we 
use daily. It can make us more productive, help us make decisions, provide a person- 
alized experience, and gain insights about the world by leveraging data. The field of 
AI is broad, encompassing search algorithms, planning and scheduling, computer 
vision, and many other areas. Machine learning, a subcategory of AI, is composed 
of three areas: supervised learning, unsupervised learning, and reinforcement learn- 
ing. Deep learning is a collection of learning algorithms that has been applied to 
each of these three areas, as shown in Fig. 1.4. Before we go further, we explain 
how exactly deep learning applies. 
Each of these areas will be explored thoroughly in the chapters of this book. 


1.1.1 Supervised Learning 


Supervised learning relies on learning from a dataset with labels for each of the 
examples. For example, if we are trying to learn movie sentiment, the dataset may 
be a set of movie reviews and the labels are the 0-5 star rating. 

There are two types of supervised learning: classification and regression 
(Fig. 1.5). 

Classification maps an input into a fixed set of categories, for example, classify- 
ing an image as either a cat or dog. 

Regression problems, on the other hand, map an input to a real number value. 
An example of this is trying to predicting the cost of your utility bill or the stock 
market price. 
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Deep Learning 





Fig. 1.4: The area of deep learning covers multiple areas of machine learning, while 
machine learning is a subset of the broader AI category 


Movie Review Label Movie Review Score 


I really enjoyed it! | though it was clever... Positive 


I have seen better films. Negative 

I can’t believe | wasted my time on that... Negative 

Honestly, | was pleasantly surprised... Positive 
~ Positive 


New example: It was spectacular! reais 


(a) 


I really enjoyed it! I though it was clever... 5 
I have seen better films. 2 
I can’t believe | wasted my time on that... 1 
Honestly, | was pleasantly surprised... 4 


New example: It was spectacular! + 1<score<5 
(b) 


Fig. 1.5: Supervised learning uses a labeled dataset to predict an output. In a clas- 
sification problem, (a) the output will be labeled as a category (e.g., positive or 
negative), while in a regression problem, (b) the output will be a value 


1.1.2 Unsupervised Learning 


Unsupervised learning determines categories from data where there are no labels 
present. These tasks can take the form of clustering, grouping similar items together, 
or similarity, defining how closely a pair of items 1s related. For example, imagine 
we wanted to recommend a movie based on a person’s viewing habits. We could 
cluster users based on what they have watched and enjoyed, and evaluate whose 
viewing habits most match the person to whom we are recommending the movie. 
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1.1.3 Semi-Supervised Learning and Active Learning 


In many situations when it is not possible to label or annotate the entire dataset 
due to either cost or lack of expertise or other constraints, learning jointly from 
the labeled and unlabeled data is called semi-supervised learning. Instead of expert 
labeling of data, if the machine provides insight into which data should be labeled, 
the process is called active learning. 


1.1.4 Transfer Learning and Multitask Learning 


The basic idea behind “transfer learning” is to help the model adapt to situations 
it has not previously encountered. This form of learning relies on tuning a general 
model to a new domain. Learning from many tasks to jointly improve the perfor- 
mance across all the tasks is called multitask learning. These techniques are becom- 
ing the focus in both deep learning and NLP/speech. 


1.1.5 Reinforcement Learning 


Reinforcement learning focuses on maximizing a reward given an action or set of 
actions taken. The algorithms are trained to encourage certain behavior and discour- 
age others. Reinforcement learning tends to work well on games like chess or go, 
where the reward may be winning the game. In this case, a number of actions must 
be taken before the reward is reached. 


1.2 History 


You don’t know where you’re going until you know where you’ve been.—James Baldwin 


It is impossible to separate the current approaches to natural language processing 
and speech from the extensive histories that accompany them. Many of the advance- 
ments discussed in this book are relatively new in comparison to those presented 
elsewhere, and, because of their novelty, it is important to understand how these 
ideas developed over time to put the current innovations into proper context. Here, 
we present a brief history of deep learning, natural language processing, and speech 
recognition. 


8 1 Introduction 


1.2.1 Deep Learning: A Brief History 


There has been much research in both the academic and industrial fields that has led 
to the current state of deep learning and its recent popularity. The goal of this section 
is to give a brief timeline of research that has influenced deep learning, although we 
might not have captured all the details (Fig. 1.6). Schmidhuber [Sch15] has compre- 
hensively captured the entire history of neural networks and various research that 
led to today’s deep learning. In the early 1940s, S. McCulloch and W. Pitts modeled 
how the brain works using a simple electrical circuit called the threshold logic unit 
that could simulate intelligent behavior [MP88]. They modeled the first neuron with 
inputs and outputs that generated 0 when the “weighted sum” was below a threshold 
and | otherwise. The weights were not learned but adjusted. They coined the term 
connectionism to describe their model. Donald Hebb in his book “The Organization 
of Behaviour (1949)” took the idea further by proposing how neural pathways can 
have multiple neurons firing and strengthening over time with usage, thus laying the 
foundation of complex processing [Heb49]. 

According to many, Alan Turing in his seminal paper “Computing Machinery 
and Intelligence” laid the foundations of artificial intelligence with several criteria 
to validate the “intelligence” of machines known as the “Turing test” [Tur95]. In 
1959, the discovery of simple cells and complex cells that constitute the primary 
visual cortex by Nobel Laureates Hubel and Wiesel had a wide-ranging influence in 
many fields including the design of neural networks. Frank Rosenblatt extended the 
McCulloch—Pitts neuron using the term Mark I Perceptron which took inputs, gener- 
ated outputs, and had linear thresholding logic [Ros58]. The weights in the percep- 
tron were “learned” by successively passing the inputs and reducing the difference 
between the generated output and the desired output. Bernard Widrow and Marcian 
Hoff took the idea of perceptrons further to develop Multiple ADAptive LINear E]- 
ements (MADALINE) which were used to eliminate noise in phone lines [WH60]. 

Marvin Minsky and Seymour Papert published the book Perceptrons which 
showed the limitations of perceptrons in learning the simple exclusive-or function 
(XOR) [MP69]. Because of a large number of iterations required to generate the 
output and the limitations imposed by compute time they conclusively proved that 
multilayer networks could not use perceptrons. Years of funding dried because of 
this and effectively limited research in the neural networks, appropriately called the 
“The First AI Winter.” 

In 1986, David Rumelhart, Geoff Hinton, and Ronald Williams published the 
seminal work “Learning representations by back-propagating errors” which showed 
how a multi-layered neural network could not only be trained effectively using a 
relatively simple procedure but how “hidden” layers can be used to overcome the 
weakness of perceptrons in learning complex patterns [RHW88]. Though there was 
much research in the past in the form of various theses and research, the works of 
Linnainmaa, S., P. Werbos, Fukushima, David Parker, Yann Le Cun, and Rumelhart 
et al. have considerably broadened the popularity of neural networks [Lin70, Wer74, 
Fuk79, Par85, LeC85]. 
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Fig. 1.6: Highlights in deep learning research 


LeCun et al. with their research and implementation led to the first widespread 
application of neural networks to the recognition of handwritten digits used by the 
U.S. Postal Service [LeC+89]. This work was an important milestone in deep learn- 
ing history, as it showed how convolution operations and weight sharing could be 
effective for learning features in modern convolutional neural networks (CNNs). 
George Cybenko showed how the feed-forward networks with finite neurons, a sin- 
gle hidden layer, and non-linear sigmoid activation function could approximate most 
complex functions with mild assumptions [Cyb89]. Cybenko’s research along with 
Kurt Hornik’s work led to the further rise of neural networks and their application as 
“universal approximator functions” [Hor91]. The seminal work of Yann Le Cun et 
al. resulted in widespread practical applications of CNNs such as the reading bank 
checks [LB94, LBB97]. 

Dimensionality reduction and learning using unsupervised techniques were 
demonstrated in Kohen’s work titled “Self-Organized Formation of Topologi- 
cally Correct Feature Maps” [Koh82]. John Hopfield with his Hopfield Networks 
created one of the first recurrent neural networks (RNNSs) that served as a content- 
addressable memory system [Hop82]. Ackley et al. in their research showed how 
Boltzmann machines modeled as neural networks could capture probability dis- 
tributions using the concepts of particle energy and thermodynamic temperature 
applied to the networks [AHS88]. Hinton and Zemel in their work presented var- 
10us topics of unsupervised techniques to approximate probability distributions 
using neural networks [HZ94]. Redford Neal’s work on the “belief net,’ similar 
to Boltzmann machines, showed how it could be used to perform unsupervised 
learning using much faster algorithms [Nea95]. 
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Christopher Watkins’ thesis introduced “Q Learning” and laid the foundations 
for reinforcement learning [Wat89]. Dean Pomerleau in his work at CMU’s NavLab 
showed how neural networks could be used in robotics using supervised techniques 
and sensor data from various sources such as steering wheels [Pom89]. Lin’s thesis 
showed how robots could be taught effectively using reinforcement learning tech- 
niques [Lin92]. One of the most significant landmarks in neural networks history 
is when a neural network was shown to outperform humans in a relatively complex 
task such as playing Backgammon [Tes95]. The first very deep learning network 
that used the concepts of unsupervised pre-training for a stack of recurrent neu- 
ral networks to solve the credit assignment problem was presented by Schmidhu- 
ber [Sch92, Sch93]. 

Sebastian Thrun’s paper “Learning To Play the Game of Chess” showed the 
shortcomings of reinforcement learning and neural networks in playing a complex 
game like Chess [Thr94]. Schraudolph et al. in their research further highlighted 
the issues of neural networks in playing the game Go [SDS93]. Backpropagation, 
which led to the resurgence of neural networks, was soon considered a problem due 
to issues such as vanishing gradients, exploding gradients, and the inability to learn 
long-term information, to name a few [Hoc98, BSF94]. Similar to how CNN archi- 
tectures improved neural networks with convolution and weight sharing, the “long 
short-term memory (LSTM)” architecture introduced by Hochreiter and Schmidhu- 
ber overcame issues with long-term dependencies during backpropagation [HS97]. 
At the same time, statistical learning theory and particularly support vector ma- 
chines (SVM) were fastly becoming a very popular algorithm on a wide variety of 
problems [CV95]. These changes contributed to “The Second Winter of AI.” 

Many in the deep learning community normally credit the Canadian Institute for 
Advanced Research (CIFAR) for playing a key role in advancing what we know as 
deep learning today. Hinton et al. published a breakthrough paper in 2006 titled “A 
Fast Learning Algorithm for Deep Belief Nets” which led to the resurgence of deep 
learning [HOTO6a]. The paper not only presented the name deep learning for the 
first time but showed the effectiveness of layer-by-layer training using unsupervised 
methods followed by supervised “fine-tuning” in achieving the state-of-the-art re- 
sults on the MNIST character recognition dataset. Bengio et al. published another 
seminal work following this, which gave insights into why deep learning networks 
with multiple layers can hierarchically learn features as compared to shallow neural 
networks or support vector machines [Ben+06]. The paper gave insights into why 
pre-training with unsupervised methods using DBNs, RBMs, and autoencoders not 
only initialized the weights to achieve optimal solutions but also provided good rep- 
resentations of data that can be learned. Bengio and LeCun’s paper “Scaling Algo- 
rithms Towards AI” reiterated the advantages of deep learning through architectures 
such as CNN, RBM, DBN, and techniques such as unsupervised pre-training/fine- 
tuning inspiring the next wave of deep learning [BL07]. Using non-linear activation 
functions such as rectified linear units overcame many of the issues with the back- 
propagation algorithm [NH10, GBB11]. Fei-Fei Li, head of artificial intelligence 
lab at Stanford University, along with other researchers launched ImageNet, which 
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collected a large number of images and showed the usefulness of data in important 
tasks such as object recognition, classification, and clustering [Den+09]. 

At the same time, following Moore’s law, computers were getting faster, and 
graphic processor units (GPUs) overcame many of the previous limitations of CPUs. 
Mohamed et al. showed a huge improvement in the performance of a complex task 
such as speech recognition using deep learning techniques and achieved huge speed 
increases on large datasets with GPUs [Moh+11]. Using the previous networks such 
as CNNs and combining them with a ReLU activation, regularization techniques, 
such as dropout, and the speed of the GPU, Krizhevsky et al. attained the smallest 
error rates on the ImageNet classification task [KSH12]. Winning the ILSVRC-2012 
competition by a huge difference between the CNN-based deep learning error rate 
of 15.3% and the second best at 26.2% put the attention of both academics and in- 
dustry onto deep learning. Goodfellow et al. proposed a generative network using 
adversarial methods that addressed many issues of learning in an unsupervised man- 
ner and is considered a path-breaking research with wide applications [Goo+14]. 

Many companies such as Google, Facebook, and Microsoft started replacing their 
traditional algorithms with deep learning using GPU-based architectures for speed. 
Facebook’s DeepFace uses deep networks with more than 120 million parameters 
and achieves the accuracy of 97.35% on a Labeled Faces in the Wild (LFW) dataset, 
approaching human-level accuracy by improving the previous results by an unprece- 
dented 27% [Tait+14]. Google Brain, a collaboration between Andrew Ng and Jeff 
Dean, resulted in large-scale deep unsupervised learning from YouTube videos for 
tasks such as object identification using 16,000 CPU cores and close to one bil- 
lion weights! DeepMind’s AlphGo’s beat Lee Sedol of Korea, an internationally 
top-ranked Go player, highlighting an important milestone in overall AI and deep 
learning. 


1.2.2 Natural Language Processing: A Brief History 


Natural language processing (NLP) is an exciting field of computer science that 
deals with human communication. It encompasses approaches to help machines 
understand, interpret, and generate human language. These are sometimes delin- 
eated as natural language understanding (NLU) and natural language generation 
(NLG) methods. The richness and complexity of human language cannot be under- 
estimated. At the same time, the need for algorithms that can comprehend language 
is ever growing, and natural language processing exists to fill this gap. Traditional 
NLP methods take a linguistics-based approach, building up from base semantic 
and syntactic elements of a language, such as part-of-speech. Modern deep learning 
approaches can sidestep the need for intermediate elements and may learn its own 
hierarchical representations for generalized tasks. 

As with deep learning, in this section we will try to summarize some impor- 
tant events that have shaped natural language processing as we know it today. 
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We will give a brief overview of important events that impacted the field up until 
2000 (Fig. 1.7). For a very comprehensive summary, we refer the reader to a well- 
documented outline in Karen Jones’s survey [Jon94]. Since neural architectures and 
deep learning have, in general, had much impact in this area and are the focus of the 
book, we will cover these topics in more detail. 

Though there were traces of interesting experiments in the 1940s, the IBM- 
Georgetown experiment of 1954 that showcased machine translation of around 
60 sentences from Russian to English can be considered an important mile- 
stone [HDG55]. Though constrained with computing resources in the form of 
software and hardware, some of the challenges of syntactic, semantic, and linguistic 
variety were discovered, and an attempt was made to address them. Similar to 
how AI was experiencing the golden age, many developments took place between 
1954-1966, such as the establishment of conferences including the Dartmouth 
Conference in 1956, the Washington International Conference in 1958, and the 
Teddington International Conference on Machine Translation of Languages and 
Applied Language Analysis in 1961. In the Dartmouth Conference of 1956, John 
McCarthy coined the term “artificial intelligence.” In 1957, Noam Chomsky pub- 
lished his book Syntactic Structures, which highlighted the importance of sentence 
syntax in language understanding [Cho57]. The invention of the phrase-structure 
grammar also played an important role in that era. Most notably, the attempts at the 
Turing test by software such as LISP by John McCarthy in 1958 and ELIZA (the 
first chatbot) had a great influence not only in NLP but in the entire field of AI. 
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Fig. 1.7: Highlights in natural language processing research 












In 1964, the United States National Research Council (NRC) set up a group 
known as the Automatic Language Processing Advisory Committee (ALPAC) to 
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evaluate the progress of NLP research. The ALPAC report of 1966 highlighted the 
difficulties surrounding machine translation from the process itself to the cost of 1m- 
plementation and was influential in reducing funding, nearly putting a halt to NLP 
research [PC66]. This phase of the 1960s—1970s was a period in the study of world 
knowledge that emphasized semantics over syntactical structures. Grammars such as 
case grammar, which explored relations between nouns and verbs, played an inter- 
esting role in this era. Augmented transition networks was another search algorithm 
approach for solving problems like the best syntax for a phrase. Schank’s concep- 
tual dependency, which expressed language in terms of semantic primitives without 
syntactical processing, was also a significant development [ST69]. SHRDLU was a 
simple system that could understand basic questions and answer in natural language 
using syntax, semantics, and reasoning. LUNAR by Woods et al. was the first of 
its kind: a question-answering system that combined natural language understand- 
ing with a logic-based system. Semantic networks, which capture knowledge as a 
graph, became an increasingly common theme highlighted in the work of Silvio 
Ceccato, Margaret Masterman, Quillian, Bobrow and Collins, and Findler, to name 
a few [Cec61, Mas61, Qui63, BC75, Fin79]. In the early 1980s, the grammatico- 
logical phase began where the linguists developed different grammar structures and 
started associating meaning in phrases concerning users’ intention. Many tools and 
software such as Alvey natural language tools, SYSTRAN, METEO, etc. became 
popular for parsing, translation, and information retrieval [Bri+87, HS92]. 

The 1990s were an era of statistical language processing where many new ideas 
of gathering data such as using the corpus for linguistic processing or understanding 
the words based on its occurrence and co-occurrence using probabilistic-based ap- 
proaches were used in most NLP-based systems [MMS99]. A large amount of data 
available through the World Wide Web across different languages created a high de- 
mand for research in areas such as information retrieval, machine translation, sum- 
marization, topic modeling, and classification [Man99]. An increase in memory and 
processing speed in computers made it possible for many real-world applications 
to start using text and speech processing systems. Linguistic resources, including 
annotated collections such as the Penn Treebank, British National Corpus, Prague 
Dependency Treebank, and WordNet, were beneficial for academic research and 
commercial applications [Mar+94, HKKS99, Mil95]. Classical approaches such as 
n-grams and a bag-of-words representation with machine learning algorithms such 
as multinomial logistic regression, support vector machines, Bayesian networks, or 
expectation—maximization were common supervised and unsupervised techniques 
for many NLP tasks [Bro+92, MMS99]. Baker et al. introduced the FrameNet 
project which looked at “frames” to capture semantics such as entities and rela- 
tionships and this led to semantic role labeling, which is an active research topic 
today [BFL98]. 

In the early 2000s, the Conference on Natural Language Learning (CoNLL) 
shared-tasks resulted in much interesting NLP research in areas such as chunk- 
ing, named entity recognition, and dependency parsing to name a few [TKSBO0O, 
TKSDM03a, BMO6]. Lafferty et al. proposed conditional random fields (CRF), 
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which have become a core part of most state-of-the-art frameworks in sequence 
labeling where there are interdependencies between the labels [LMPO1]. 

Bengio et al. in the early 2000s proposed the first neural language model which 
used mapping of 1 previous words using a lookup table feeding into a feed-forward 
network as a hidden layer and generating an output that is smoothed into a soft- 
max layer to predict the word [BDVOO]. Bengio’s research was the first usage of the 
“dense vector representation” instead of the “‘one-hot vector” or bag-of-words model 
in NLP history. Many language models based on recurrent neural networks and 
long short-term memory, which were proposed later, have become the state of the 
art [Mik+10b, Gral3]. Papineni et al. proposed the bilingual evaluation understudy 
(BLEU) metric which is used even today as a standard metric for machine trans- 
lations [Pap+02]. Pang et al. introduced sentiment classification, which is now one 
of the most popular and widely studied NLP tasks [PLV02]. Hovy et al. introduced 
OntoNotes, a large multilingual corpus with multiple annotations used in a wide 
variety of tasks such as dependency parsing and coreference resolution [Hov+06a]. 
The distant supervision technique, by which existing knowledge is used to generate 
patterns that can be used to extract examples from large corpora, was proposed by 
Mintz et al. and is used in a variety of tasks such as relation extraction, information 
extraction, and sentiment analysis [Min+09]. 

The research paper by Collobert and Weston was instrumental not only in high- 
lighting ideas such as pre-trained word embeddings and convolutional neural net- 
works for text but also in sharing the lookup table or the embedding matrix for 
multitask learning [CW08]. Multitask learning can learn multiple tasks at the same 
time and has recently become one of the more recent core research areas in NLP. 
Mikolov et al. improved the efficiency of training the word embeddings proposed 
by Bengio et al. by removing the hidden layer and having an approximate objective 
for learning that gave rise to “word2vec,” an efficient large-scale implementation 
of the embeddings [Mik+13a, Mik+13b]. Word2vec has two implementations: (a) 
continuous bag-of-words (CBOW), which predicts the center word given the nearby 
words, and (b) skip-gram, which does the opposite and predicts the nearby words. 
The efficiency gained from learning on a large corpus of data enabled these dense 
representations to capture various semantics and relationships. Word embeddings 
used as representations and pre-training of these embeddings on a large corpus for 
any neural-based architecture are standard practice today. Recently many extensions 
to word embeddings, such as projecting word embeddings from different languages 
into the same space and thus enabling “transfer learning” in an unsupervised manner 
for various tasks such as machine translation, have gained lots of interest [Con+17]. 

Sutskever’s Ph.D. thesis which introduced the Hessian-free optimizer to train 
recurrent neural networks efficiently on long-term dependencies was a milestone in 
reviving the usage of RNNs especially in NLP [Sut13]. Usage of convolutional neu- 
ral networks on text surged greatly after advances made by Kalchbrenner et al. and 
Kim et al. [KGB14, Kim14]. CNNs are now widely used across many NLP tasks be- 
cause of their dependency on the local context through convolution operation, mak- 
ing it highly parallelizable. Recursive neural networks, which provide a recursive 
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hierarchical structure to the sentences and are inspired by linguistic approach, be- 
came another important neural architecture in the neural-based NLP world [LSM13]. 

Sutskever et al. proposed sequence-to-sequence learning as a general neural 
framework composed of an encoder neural network processing inputs as a sequence 
and a decoder neural network predicting the outputs based on the input sequence 
states and the current output states [SVL14]. This framework has found a wide 
range of applications such as constituency parsing, named entity recognition, ma- 
chine translation, question-answering, and summarization. Google started replacing 
its monolithic phrase-based machine translation models with sequence-to-sequence 
neural MT models [Wu+16]. Character-based rather than word-based representa- 
tions Overcome many issues such as out of vocabulary and have been part of re- 
search in deep learning based systems for various NLP tasks [Lam+16, PSG16]. The 
attention mechanism by Bahdanat et al. is another innovation that has been widely 
popular in different neural architectures for NLP and speech [BCB14b]. Memory 
augmented networks with various variants such as memory networks, neural Tur- 
ing machines, end-to-end memory networks, dynamic memory networks, differ- 
entiable neural computers, and recurrent entity networks have become very pop- 
ular in the last few years for complex natural language understanding and language 
modeling tasks [WCB14, Suk+15, GWD14, Gra+16, Kum+16, Gre+15, Hen+16]. 
Adversarial learning and using adversarial examples have recently become com- 
mon for distribution understanding, testing the robustness of models, and transfer 
learning [JL17, Gan+16]. Reinforcement learning is another emerging field in deep 
learning and has applications in NLP, specifically in the areas where there is a tem- 
poral dependencies and non-differentiable optimization zones where gradient-based 
methods fail. Modeling dialog systems, machine translation, text summarization, 
and visual storytelling among others have seen the benefits of reinforcement tech- 
niques [Liu+18, Ran+15, Wan+18, PXS17]. 


1.2.3 Automatic Speech Recognition: A Brief History 


Automatic speech recognition (ASR) is quickly becoming a mainstay in human— 
computer interaction. Most of the tools used today have an option for speech recog- 
nition for various types of dictation tasks, whether it is composing a text message, 
playing music through a home-connected device, or even text-to-speech applica- 
tions with virtual assistants. Although many of the techniques have recently gained 
popularity, research and development of ASR began in the middle of the twentieth 
century (Fig. 1.8). 

The earliest research in ASR began in the 1950s. In 1952, Bell Laboratories 
created a system to recognize the pronunciation of isolated digits from a single 
speaker, using formant frequencies (frequencies that correlate to human speech for 
certain sounds) from the speech power spectrum. Many research universities built 
systems to recognize specific syllables and vowels for a single talker [JROSb]. 
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In the 1960s, small vocabulary and acoustic phonetic-based tasks became prime 
research areas, leading to many techniques centered around dynamic programming 
and frequency analysis. IBM’s Shoebox was able to recognize not only digits but 
also words such as “sum” and “total” and use these in the arithmetic computa- 
tions to give results. The researchers in University College in England could analyze 
phonemes for recognition of vowels and consonants [JROSa]. 

In the 1970s, research moved towards medium-sized vocabulary tasks and con- 
tinuous speech. The dominant techniques were various types of pattern recognition 
and clustering algorithms. Dynamic time warping (DTW) was introduced to han- 
dle time variability, aligning an input sequence of features to an output sequence 
of classes. “Harpy,” a speech recognizer from Carnegie Mellon University, was ca- 
pable of recognizing speech with a vocabulary of 1011 words. One of the main 
achievements of this work was the introduction of the graph search to “decode” lex- 
ical representations of words with a set of rules and a finite state network [LR90O]. 
The methods that would optimize this capability, however, were not introduced un- 
til the 1990s. A recognition system called Tangora [JBM75] was created by IBM to 
provide a “voice-activated typewriter.’ This effort introduced a focus on large vo- 
cabularies and the sequence of words for grammars, which led to the introduction 
of language models for speech. During this era, AT&T also played a significant role 
in ASR, focusing heavily on speaker-independent systems. Their work, therefore, 
focused more heavily on what is called the acoustic model, dealing with the anal- 
ysis of the speech patterns across speakers. By the mid-late 1970s, hidden Markov 
models were used to model spectral variations for discrete speech. 

In the 1980s, the fundamental approach of ASR shifted to a statistical foundation, 
specifically HMM methods for modeling transitions between states. By the mid- 
1980s, HMMs had become the dominant technique for ASR (and remains one of 
the most prominent today). This shift to HMMs allowed many other advancements 
such as speech decoding frameworks with FSTs. The 1980s saw the introduction 
of neural networks for speech recognition. Their ability to approximate any func- 
tion made them an exciting candidate for predicting the state transitions while still 
relying on the HMM to handle the temporal nature of continuous speech. Various 
toolkits were created during this period to support ASR, such as Sphinx [Lee88] and 
DECIPHER [Mur+89] from SRI. 

In the 1990s, many advancements in machine learning were incorporated into 
ASR, which led to improved accuracy. Many of these were software which became 
available commercially, such as Dragon which had a dictionary of 80,000 words 
and the ability to train the software to the user’s voice. Many toolkits were created 
to support ASR in the late 1980s and 1990s such as HTK [GW0O1] from Cambridge, 
a hidden Markov model toolkit. 

The time delay neural network (TDNN) [Wai+90] was one of the earliest appli- 
cations of deep learning to speech recognition. It utilized stacked 2D convolutional 
layers to perform phone classification. The benefits of this approach were that it was 
shift-invariant (not requiring a segmentation); however, the width of the network 
limits the context window. The TDNN approach was comparable to early HMM- 
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based approaches; however, it did not integrate with HMMs and was difficult to use 
in large vocabulary settings [YD 14]. 
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In the 2000s, the focus on machine learning advancements continued. In 
[MDH09] deep belief networks were applied to phone recognition, achieving 
state-of-the-art performance on the TIMIT corpus.! These networks learn un- 
supervised features for better acoustic robustness. In [Dah+12] a hybrid DNN 
and context-dependent (CD) hidden Markov model was introduced that extended 
the advancements of the DNN and achieved substantial improvements for large 
vocabulary speech recognition. Deep neural networks continued to advance the 
state-of-the-art during the 2000s, and the DNN/HMM hybrid model became the 
dominant approach. 

Since 2012, deep learning has been applied to the sequence portion of the ASR 
task, replacing the HMM for many of the techniques, moving towards end-to-end 
models for speech recognition. With their introduction, many of the modern meth- 
ods have been making their way into ASR, such as attention [Cho+15] [KHW17], 
and RNN transducers [MPRO8]. The incorporation of sequence-to-sequence archi- 
tectures with larger datasets allows the models to learn the acoustic and linguistic 
dependencies directly from the data, leading to higher quality. 

End-to-end research has continued to develop in recent years, focusing on im- 
proving some of the difficulties that arise from end-to-end models; however, hy- 
brid architectures tend to remain more popular in production, due to the usefulness 
of lexicon models in decoding. For a more detailed survey of the history of ASR, 
[TGO1] is recommended. 


 http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogld=LDC938 1. 
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1.3 Tools, Libraries, Datasets, and Resources for the 
Practitioners 


There are a myriad of open-source resources available to the reader interested in 
building NLP, deep learning, or speech analytics. In the section below, we provide a 
list of the more popular libraries and datasets. This is in no way an exhaustive list, 
as our goal is to familiarize the reader with the wide range of available frameworks 
and resources. 


1.3.1 Deep Learning 


As with the techniques in NLP, deep learning frameworks have also seen tremendous 
progress in recent years. Numerous frameworks exist, each with their specializa- 
tion. The most popular deep learning frameworks are: TensorFlow, PyTorch, Keras, 
MXNet, CNTK, Chainer, Caffe2, PaddlePaddle, and Matlab.* The main component 
of a modern deep learning framework is efficiency in linear algebra capabilities, 
as this applies to deep learning, with support CPU and GPU computation (dedi- 
cate hardware like TPUs [Joul6] are becoming increasingly popular as well). All 
relevant Python frameworks encompass both CPU and GPU support. The imple- 
mentation differences tend to focus on the trade-offs between the intended end user 
(researcher vs. engineer vs. data scientist). 

We look at the top deep learning frameworks from the ones mentioned previously 
and give a brief description of them. In order to do this, we compare the Google 
Trends for each of the frameworks and focus on the top 3.° As shown in Fig. 1.9, the 
top framework (worldwide) is Keras, followed by TensorFlow, and then PyTorch. 
Additional information about the frameworks used in the case studies will be given 
throughout the book. 


® TensorFlow ® PyTorch © Keras ® Theano ® MxNet 


| re ee nell 


Fig. 1.9: Google trends for deep learning frameworks worldwide 


* Theano exists as another popular framework; however, major development has discontinued given 
the popularity of more recent frameworks. It is therefore not included in this book. 

> Although this is a single, statistically insignificant, data point, the Google Trends mechanism 
is useful and roughly correlates with other evaluations such as number of contributors, GitHub 
popularity, number of articles written, and books written for the various frameworks. 
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The following are popular open-source frameworks for building neural networks: 


e TensorFlow: TensorFlow is a computational library based on data flow graphs. 
These graphs have nodes representing mathematical operations and edges rep- 
resenting tensors that flow between them. Written in Python, TensorFlow was 
developed by the Google Brain team. 


e Keras: Keras is a simple, high-level, Python library developed to enable fast 
prototyping and experimentation. It can run on top of TensorFlow and CNTK, 
and is now part of the TensorFlow core library. Keras contains implementations 
of common neural network components and numerous architecture examples. 


e PyTorch: PyTorch is a Python package for rapid prototyping of neural networks. 
It is based on Torch, an extremely fast computational framework, and provides 
dynamic graph computation at runtime. PyTorch was developed by the Facebook 
AI research team. 


e Caffe: Caffe is a high-performance C++ framework for building deep learning 
architectures that can natively support distributed and multi-GPU execution. The 
current version, Caffe2, is the backend used by Facebook in production. 


e CNTK: renamed the Microsoft Cognitive Toolkit, CNTK is a computational 
framework based on directed graphs. It supports the Python, C#, or C++ lan- 
guages, and was developed by Microsoft Research. 


e MXNet: MXNet is a high-performance computational framework written in 
C++ with native GPU support incubated by the Apache Project. 


e Chainer: Chainer is a pure Python-based framework with dynamic computa- 
tional graph capability defined at runtime. 


1.3.2 Natural Language Processing 


The following resources are some of the more popular open-source toolkits for nat- 
ural language processing: 


e Stanford CoreNLP: A Java-based toolkit of linguistic analysis tools for process- 
ing natural language text. CoreNLP was developed by Stanford University. 


e NLTK: The Natural Language Toolkit, or NLTK for short, is an open-source 
suite of libraries for symbolic and statistical natural language processing for En- 
glish. It was developed by the University of Pennsylvania. 


e Gensim: A Python-based, open-source toolkit that focuses on vector space and 
topic modeling of text documents. 
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e spaCy: A high-performance Python-based toolkit for advanced natural language 
processing. SpaCy is open-source and supported by Explosion AI. 


e OpenNLP: An open-source machine learning toolkit for processing text in nat- 
ural language. OpenNLP is a sponsored by the Apache Project. 


e AllenNLP: An NLP research library built in PyTorch. 


1.3.3 Speech Recognition 


The following resources are some of the more popular open-source toolkits for 
speech recognition.* 


1.3.3.1 Frameworks 


e Sphinx: An ASR toolkit developed by Carnegie Mellon University, focused on 
production and application development. 


e Kaldi: An open-source C++ ASR framework, built for research-focused speech 
processing as well as for professional use. 


e ESPnet: An end-to-end deep learning-based, ASR framework, inspired by Kaldi 
and written with PyTorch and Chainer backends. 


1.3.3.2 Audio Processing 


e SoX: An audio manipulation toolkit and library. It implements many file formats 
and is useful for playing, converting, and manipulating audio files [NB18]. 


e LibROSA: A Python package for audio analysis, commonly used for feature 
extraction and digital signals processing (DSP) [McF+15]. 


1.3.3.3 Additional Tools and Libraries 


e KenLM: A high-performance, n-gram language modeling toolkit commonly in- 
tegrated with ASR frameworks. 


e LIME: Local interpretable model-agnostic explanation (LIME), a local and 
model-agnostic explainer for machine learning and deep learning models. 


+ Code repositories containing specific implementations that do not provide a full framework may 
be used, but are not included on this list. 
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1.3.4 Books 


The fields of NLP and machine learning are very broad and cannot all be con- 
tained in a single resource. Here we recognize various books that provide deeper 
explanations, for supplementary information. The Elements of Statistical Learn- 
ing by Hastie et al. gives a good base to machine learning and statistical tech- 
niques [HTFO1]. Learning From Data by Abu-Mostafa et al. also provides insights 
into theories in machine learning in a more simplified and understandable manner 
[AMMIL12]. Deep Learning (Adaptive Computation and Machine Learning series) 
[GBC16] presents an advanced theory-focused book. This is widely considered the 
foundational book for deep learning, moving from the fundamentals for deep learn- 
ing (Linear Algebra, Probability Theory, and Numerical computation) to an explo- 
ration of many architecture implementations and approaches. Foundations of Sta- 
tistical Natural Language Processing [MS99] is a comprehensive resource on sta- 
tistical models for natural language processing, providing an in-depth mathematical 
foundations for implementing NLP tools. Speech and Language Processing [Jur00] 
provides an introduction to NLP and speech, with both breadth and depth in many 
areas of statistical NLP. More recent editions explore neural network applications. 

Neural Network Methods in Natural Language Processing (Synthesis Lectures 
on Human Language Technologies)[Goll17] presents neural network applications 
on language data, moving from the introduction of machine learning and neural net- 
works to specialized neural architectures for NLP applications. Automatic Speech 
Recognition: A Deep Learning Approach by Yu et al. gives a thorough introduction 
to ASR and deep learning techniques [YD15]. 


1.3.5 Online Courses and Resources 


Below we list some of the online courses where the experts in the field teach topics 
related to deep learning, NLP, and speech and are extremely beneficial. 


e Natural Language Processing with Deep Learning 
http://web.stanford.edu/class/cs224n/ 

e Deep Learning for Natural Language Processing 
http://www.cs.ox.ac.uk/teaching/courses/2016-2017/dl 

e Neural Networks for NLP 
http://phontron.com/class/nn4nlp2017/schedule.html 

e Deep Learning Specialization 
https://www.deeplearning.ai/deep-learning-specialization/ 

e Deep Learning Summer School 
https://vectorinstitute.ai/20 18/1 1/07/vector-institute-deep-learning-and- 
reinforcement-learning-2018-summer-school/ 

e Convolutional Neural Networks for Visual Recognition 
http://cs23 1n.stanford.edu/ 
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e Neural Networks for Machine Learning 
http://www.cs.toronto.edu/~hinton/coursera_lectures.html 

e Neural Networks 
http://info.usherbrooke.ca/hlarochelle/neural_networks/content.html 

e Practical Deep Learning For Coders 
https://course.fast.ai/ 

e Intro to Deep Learning with PyTorch 
https://www.udacity.com/course/deep-learning-pytorch--ud188 


1.3.6 Datasets 


Any end-to-end deep learning application is reliant on data. While most organiza- 
tions rely on data collection as a fundamental part of their strategy, there are many 
publicly available datasets for researchers, hobbyists, and practitioners. 

Kaggle® is one of the most popular sources for datasets and competitions for 
machine learning and data science. With thousands of datasets and competitions 
available, an active community of over one million registered users, and funded 
competitions through its platform, Kaggle serves as a strong starting point for not 
only datasets but also techniques for many tasks. 

The Linguistic Data Consortium’ combines and sells datasets from a variety of 
universities, corporations, and research labs. It primarily focuses on linguistic data 
and language resources, with close to 1000 datasets. 

Stanford NLP group also releases a number of datasets for NLP, specifically for 
the training of their CoreNLP models.® 


e Text Similarity: 


SentEval [CK18] |Evaluation library for sentence embeddings, comparing 
their effectiveness on 17 downstream tasks 


Quora Question {Collection of 400,000 question pairs of potentially dupli- 
Pairs [ZCZ] cate questions from the Quora website with the goal of 
identifying duplicates 





> Additional text tasks and datasets are captured with associated papers at https://nlpprogress.com/. 
® https://www.kaggle.com. 

7 https://www.ldc.upenn.edu/. 

8 https://nlp.stanford.edu/data/. 
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e Text Clustering and Classification: 
Reuters-21,578 {A collection of 21,578 articles that appeared on Reuters 
[Zdr+18] newswire in 1987 
Open ANC Approximately 15 million spoken and written words of 
[MIGO2] American English from a variety of sources annotated for 


syntactic structure 

MASC [Ide+10] |A subset of approximately 500,000 words of contempo- 
rary American English drawn from the Open American 
National Corpus annotated for syntactic, semantic, and 
discourse structure 





Dependency Parsing: 


Penn Treebank {A corpus of part-of-speech annotated American English, 

consisting of 4.5 million words 

Entity Extraction: 

CoNLL 2003 Newswire text tagged with various types of entities, such 

[TKSDMO03b] as location, organization, person, etc. 

WNUT2017 A dataset composed of annotated tweets, YouTube com- 

[Der+17] ments, and other internet sources, tagged with different 
entities 

OntoNotes A multilingual corpus with many labels, such as part-of- 


[Hov+06b] speech, parse trees, and entities (two million tokens in 
version 5) 








Relation Extraction: 

NYT Corpus New York Times corpus with relational labels for dis- 
[RYM10] tantly related entities 

SemEval-2010 _|A semantic relationship dataset, containing types of rela- 
(Task 8) tionships between, such as entity-origin and cause-effect 
[Hen+09] 





Semantic Role Labeling: 


Dataset____[Description 





OntoNotes A dataset of 1.7 million words focused on modeling the 
[Pra+13] role of an entity in text 
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e Machine Translation: 
A collection of multilingual sentence pairs from the 
Tatoeba website https://tatoeba.org 
An English-French (and additional English-German) 


dataset consisting of sentences taken from many sources, 
such as common crawl, UN corpus, and new commentary 
Crowdsourced descriptions of images in multiple lan- 
guages 





e Text Summarization: 

CNN/Daily Mail |Approximately 300,000 news stories from CNN and 

[Nal+16] Daily Mail websites with corresponding summary bullet 
points 

Cornell Over 1.3 million news articles and summaries from 38 

Newsroom major publications between 1998 and 2017 

[GNA18] 

Google Dataset |A sentence-compression task with 200,000 examples that 

[FA 13] focuses on deleting words to produce a compressed struc- 
ture of the original, longer sentence 

DUC A smaller sentence summarization task with newswire 
and document data, containing 500 documents 

Webis-TLDR-17 |A dataset containing Reddit posts with the “too long; 

Corpus [Sye+18] |didn’t read (TL;DR)” summaries for over three million 
posts 





e Question Answer: 


bAbI [Wes+15] |A collection of tasks and associated datasets to evaluate 
NLP models, specifically geared towards Q&A 


NewsQA A set of 100,000 challenging question—answer pairs for 
[Tri+16] news articles from CNN 

SearchQA A general question-answering set for search QA 
[Dun+17] 
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e Speech Recognition: 

AN4 [Ace90] A small, alphanumeric dataset containing randomly gen- 
erated words, numbers, and letters 

WSJ [PB92] A general purpose, large vocabulary, speech recognition 
dataset, consisting of 400h of spoken audio and tran- 
scripts 

LibriSpeech 1000 h Dataset containing read speech from audiobooks 

[Pan+15] from the Libri Vox audiobook project 

Switchboard Conversational dataset, consisting of more than 240 h of 

[GHM92] audio. The combined test with CallHome English set is 
referred to as Hub5’00 

TED-LIUM 452h of TED talks with corresponding transcripts and 

[RDE12] alignment information. The most recent version (3) 
[Her+18] has twice as much data as the previous versions 

CHiME [Vin+16]|Is a speech challenge that has consisted of various tasks 
and datasets throughout the years. Some of the tasks 1n- 
clude speech separation, recognition, speech processing 
in noisy environments, and multi-channel recognition 

TIMIT [Gar+93] |An ASR dataset for phonetic studies with 630 speakers 
reading phonetically rich sentences 





1.4 Case Studies and Implementation Details 


In this book, the goal is not only to inform, but also to enable the reader to practice 
what is learned. In each of the following chapters a case study explores chapter 
concepts in detail and provides a chance for hands-on practice. The case studies 
and supplementary code are written in Python, utilizing a various deep learning 
frameworks. In most cases, deep learning depends on high-performance C++ or 
CUDA libraries to perform computation. In our experience, the installation process 
can become an extremely tedious process, especially when getting started with deep 
learning. We attempt to limit this difficulty by providing docker [Mer14] images and 
a GitHub repository for each case study. Docker is a simple yet powerful tool that 
provides a virtual machine-like environment for high-level (Python) and low-level 
(C++ and CUDA) libraries to operate in isolation from the operating system. 


Directions for accessing and running the code are provided in the repository. 
(https://github.com/SpringerNLP). 
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The case studies are as follows: 


Chapter 2: An introduction to machine learning classification of the Higgs Bo- 
son Challenge. This introduces basic concepts of machine learning along with 
elements of data science. 


Chapter 3: A text clustering, topic modeling, and text classification based on the 
Reuters-21,578 dataset is used to show some fundamental approaches to NLP. 


Chapter 4: The fundamentals of supervised and unsupervised deep learning are 
introduced on spoken digit recognition using the FSDD dataset. 


Chapter 5: Embedding methods are introduced with a focus on text based rep- 
resentations on the Open American National Corpus. 


Chapter 6: Text classification is explored with a variety of convolutional neural 
network-based methods on Twitter airlines dataset. 


Chapter 7: Various recurrent neural networks architectures are compared for 
neural machine translation to perform English-to-French translation. 


Chapter 8: Traditional HMM-based speech recognition is explored on the Com- 
mon Voice dataset, using Kaldi and CMUSphinx. 


Chapter 9: This chapter has two distinct case studies. In the first, neural machine 
translation from Chap. 7 is extended, exploring various attention-based architec- 
tures. In the second, memory augmented networks are compared for question- 
answering tasks based on the bAbI dataset. 


Chapter 10: Understanding multitask learning with different architectures on 
NLP tasks such as part-of-speech tagging, chunking, and named entity recogni- 
tion is the focus of the case study in this chapter. 


Chapter 11: Transfer learning and specifically domain adaptation using various 
techniques on the Amazon Review dataset is performed. 


Chapter 12: Continuing the ASR case study from Chap. 8, end-to-end tech- 
niques are applied to speech recognition leveraging CTC and attention on the 
Common Voice dataset. 


Chapter 13: Two popular reinforcement learning algorithms are applied to the 
task of abstractive text summarization using the Cornell Newsroom dataset. 
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Chapter 2 ®@ 
Basics of Machine Learning oe 


2.1 Introduction 


The goal of this chapter is to review basic concepts in machine learning that are 
applicable or relate to deep learning. As it is not possible to cover every aspect of 
machine learning in this chapter, we refer readers who wish to get a more in-depth 
overview to textbooks, such as Learning from Data [AMMIL12] and Elements of 
Statistical Learning Theory {HTFO9]. 

We begin by giving the basic learning framework for supervised machine learn- 
ing and the general learning process. We then discuss some core concepts of ma- 
chine learning theory, such as VC analysis and bias—variance trade-off, and how 
they relate to overfitting. We guide the reader through various model evaluation, 
performance, and validation metrics. We discuss some basic linear classifiers start- 
ing with discriminative ones, such as linear regression, perceptrons, and logistic re- 
gression. We then give the general principle of non-linear transformations and high- 
light support vector machines and other non-linear classifiers. In the treatment of 
these topics, we introduce core concepts, such as regularization, gradient descent, 
and more, and discuss their impact on effective training in machine learning. Gener- 
ative classifiers, such as naive Bayes and linear discriminant analysis, are introduced 
next. We then demonstrate how basic non-linearity can be achieved through linear 
algorithms via transformations. We highlight common feature transformations, such 
as feature selection and dimensionality reduction techniques. Finally, we introduce 
the reader to the world of sequence modeling through Markov chains. We pro- 
vide necessary details in two very effective methods of sequence modeling: hidden 
Markov models and conditional random fields. 

We conclude the chapter with a detailed case study of supervised machine learn- 
ing using a real-world problem and dataset to carry out a systematic, evidence-based 
machine learning process that allows putting into practice the concepts related in this 
chapter. 
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2.2 Supervised Learning: Framework and Formal Definitions 


As discussed in Chap. 1, supervised machine learning is the task of learning from 
answers (labels, or the ground truth) provided by an oracle in a generalized manner. 
A simple example would be learning to distinguish apples from oranges. The pro- 
cess of supervised learning is shown schematically in Fig. 2.1, and we will refer to 
it for most of the chapters in this book. Let us now describe each of the components 
of the supervised learning process. 


2.2.1 Input Space and Samples 


The population of all possible data for a particular learning problem (e.g., discrim- 
inating apples from oranges) is represented by an arbitrary set X. Samples can be 
drawn independently from the population X with a probability distribution P(X), 
which is unknown. They can be represented formally as: 


X = X1,X2,...,Xn (2.1) 


Note that X C X. An individual data point in the set X drawn from the input 
space X, also referred to as an instance or an example, is normally represented in 
vector form as x; of d dimensions. The elements of a vector x; are also referred to as 
features or attributes. For example, apples and oranges can be defined in terms of 
{ shape, size, color}, i.e., using d = 3 features/attributes. The features can be cate- 
gorical or nominal, such as color = {red, green, orange, yellow}. Alternatively, they 
can be ordinal. In the latter case, the features can be discrete (taking on a finite num- 
ber of values) or continuous; e.g., each feature i € [d] can be a scalar in R, yielding 
a feature space of R?: 


Xj = Xi1,%i2,---,Xid (2.2) 
es RAE NY BELEN Te, ee eae ee ae oe STE ROO RN ON EP ae ob cde tis yaw ep entae LEME hs Ghee oe RE RTE Ek SOMERS TTI Ee rere : 
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! fix—Y Ls : 
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Fig. 2.1: The schematic summarizes the supervised learning process 
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This set of features can be also seen as d-dimensional vectors f = f/f), fo,..., fa, 
which is useful in various feature transformation and selection processes. 
The whole input data and corresponding labels can be viewed in matrix form as: 


X11 X12 °** X1d y1 
X21 X22 °° X2d y2 
Xn Xn2 °** Xnd Yn 


In the above representation, row i of the matrix X stores sample x;, and its label 
y; can be found in row i of the matrix Y. 

Alternatively, we can linearly represent the input data and corresponding labels 
as a labeled dataset Digheleg: 


Diabeled = (%1,¥7); (X22), --(Xn In) (2.4) 


2.2.2 Target Function and Labels 


Beside the probability distribution P(X), another unknown entity is the target func- 
tion or the ideal function that maps the input space X to the output space Y. This 
function is formally represented as [f : X — Y]. The objective in machine learning 
is to find an approximation function close to the unknown target function f. 

The output space Y represents all possible values that the target function f can 
map inputs to. Generally, when the values are categorical, finding an approximation 
to f is known as the classification problem, and when the values are continuous, 
the problem is known as regression. When the output can only take two values, the 
problem is known as binary classification (e.g., apples and oranges). In regression, 
yi ER. 

Sometimes it is advantageous to think not in terms of an exact mapping or a 
deterministic output for an instance (x,y), but instead in terms of a target joint prob- 
ability distribution P(x, y) = P(x)P(y|x). The latter better accommodates real-world 
data that contain noise or stochasticity, and we shall have more to say on this later. 


2.2.3 Training and Prediction 


The entire learning process can now be defined in terms of finding the approximate 
function h(x) from a large hypothesis space 4, such that h(x) can effectively fit the 
given data points Djgpeieq in such a way that h(x) + f(x). The measure of success is 
normally quantified by error (alternatively referred to as empirical risk, loss, or cost 
function), which measures the discrepancy between h(x) and the unknown target 
function f(x). Specifically, 
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Error = E(h(x), f(x)) © e(h(x),y) (2.5) 


where E(h(x), f(x)) is the real error from the target function f(x), which is an 
unknown and approximated by the error obtained through data and labels given by 
e(h(x),y). 

In the classification domain, the single-point error on a datum (x, y) can be binary 
valued (recording a mismatch) and formally written as: 


E(x,y) = [A(x) #y] (2.6) 


The ||] represents a function that yields 1 when the values are not equal and 0 
otherwise. In the regression domain, the single-point error on a datum (x,y) can be 
the squared error: 


E(x,y) = (h(x) —y))? (2.7) 


The training error over the entire labeled dataset can be measured using the mean 
over individual (single-point) errors for classification and regression, as: 


1 N 
Ejabetea(h) = N » [A(Xn) A Yn] (2.8) 
n=1 
1 N 
Ejabeted(h) = Ny (h(x) —yn)? (2.9) 
n=1 


The prediction or out-of-sample error can be computed using the expected value 
on the unseen datum (x, y): 


Eo (h) = Exle(h(x),y)] (2.10) 


2.3 The Learning Process 


Machine learning is the process that seeks to answer three questions: 


1. How to train model parameters from labeled data? 
2. How to select hyperparameters for a model given labeled data? 
3. How to estimate the out-of-sample error from labeled data? 


This process is generally done by logically dividing the entire labeled 
dataset Drapeleg into three components: (a) the training set D7,gin, (b) the val- 
idation set Dy,;, and (c) the test set D7,s,, as shown in Fig. 2.2. The training 
set, Dr;ain, 18 used to train a given model or hypothesis and learn the model 
parameters that minimize the training error E7,gi,. The validation set, Dyg;, 
is used to select the best parameters or models that minimize the validation 
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error Ey,q;, which serves as a proxy to the out-of-sample error. Finally, the test 
set, Dr..;, is used to estimate the unbiased error of the model trained with 
the best parameters over Dy,; and with learned parameters over D7, qin. The 
unbiased error gives a good estimate of the error on unseen future data. 


Labeled Training Dataset | 
Diavetea = {(%1,¥1), (Xz, V2),--(Xns Yn)} 




















Dryain Dyat Dreset 
{(x,, ¥; )AX2, Fz )e-A Xs Ym) {(%n4+ 1! Vingi)--of (X,, ¥u)} {(Xq Ll! Vex) (Xn. ¥en } 
Errain Eva Erest 


Fig. 2.2: The labeled dataset Dygpeieg iS split into the training D7;gin, validation 
Dyai, and testing Dr, datasets 


2.4 Machine Learning Theory 


In this section we will review basic theory associated with machine learning to ad- 
dress core issues in any learning scenario. 


2.4.1 Generalization—Approximation Trade-Off via the 
Vapnik—Chervonenkis Analysis 


The process of fitting a hypothesis function or a model to the labeled dataset can lead 
to a core problem in machine learning that is known as overfitting. The issue here is 
that we have used all the labeled data points to reduce the error, but this leads to poor 
generalization. Let us consider a simple, one-dimensional dataset generated using 
sin(x) as the target function with Gaussian noise added. We can illustrate the issue 
of overfitting using a hypothesis set of polynomial functions. The different degrees 
can be treated as parameters via which one can obtain different hypothesis functions 
with which to fit the labeled data. Figure 2.3 shows how by changing the degree 
parameter we can reduce the training error significantly by effectively increasing 
the complexity of the model. Figure 2.3 shows how the choice of the hypothesis 
function can result in underfitting or overfitting. The hypothesis function that is 


44 2 Basics of Machine Learning 


a polynomial of degree | poorly approximates the target function due to its lack of 
complexity. In contrast, the hypothesis function that is so complex (degree 15 also 
in Fig. 2.3) has even modeled the noise in the training data, resulting in overfitting. 
Finding the right hypothesis that matches the given resources (training data) in such 
a way that there 1s a balance between the approximation and generalization trade-off 
is the holy grail of machine learning. 
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Fig. 2.3: Illustration of underfitting and overfitting in fitting a target function with 
Gaussian noise with different-degree polynomials 


Thus, there are two distinct errors to consider: (a) the In-sample training error, 
given by E;rain(h), that measures the approximation aspect of the trade-off, and (b) 
the out-of sample error, given by E,,;(/), that has to be estimated and measures 
the generalization aspect of the trade-off. 

The probably approximately correct (PAC) learnability theorem provides the fol- 
lowing theoretical bound between the two errors in terms of the probability of the 
model being approximately correct: 


=@*N ) 


P{\Etrain(h) — Eour(h)| > €] < 4mgc(2N)e\-8 





(2.11) 


The equation bounds the probability that the absolute difference between the 
two errors Eyrgin(h) and Ey; (h) is smaller than € by something known as the growth 
function m3, and the number of training data samples N. It has been shown that even 
with an infinite hypothesis space of the learning algorithm (such as a perceptron, as 
we will discuss later), the growth function is finite. In particular, the growth func- 
tion has a tight upper bound measured in terms of the Vapnik—Chervonenkis (VC) 
dimension dyc, which is the largest N that can be shattered; i.e., my¢(N) = 2”. This 
makes m4,_(N) polynomial in the number of data points [Vap95]. Thus, 


m3¢(N) < N4vVE +1 (2.12) 
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mgc(N) ~ Nave (2.13) 


Thus, Eq. 2.11 can be rewritten as: 
dyclog(N 
Eout(h) < Etrain(h) +0( soe) (2.14) 


The VC dimension, given by dyc, correlates with the model complexity, and the 
above equation can be further rewritten as: 


out (h) = Lain (h) a QO(dyc) (2.15) 
—$S 
penalty 

Figure 2.4 captures the relationship between Eo,;(4) and Eyrain(h) in the above 
equations. When the model complexity is below an optimum threshold dj, both 
the training error and the out-of-sample error are decreasing. Choosing any model 
to represent the data below this optimal threshold will lead to underfitting. When the 
model complexity is above the threshold, the training error Ej;gin(h) still decreases, 
but the out-of-sample error E,,;(h) increases, and choosing any model with that 

complexity will lead to overfitting. 


The PAC analysis in terms of the VC dimension gives an upper bound of 
the out-of-sample error given the training set and is independent of both the 
target function f : X — Y and the probability distribution according to which 
samples are drawn from the population. Recall that both the target function 
and the probability distribution are unknown. 


Underfitting . Overfitting 
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Fig. 2.4: Model complexity and learning curves. (a) Relationship between training 


error, out-of-sample error, and model complexity. (b) Learning curves for relation- 
ship between training error, out-of-sample error, and number of data points 
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2.4.2 Generalization—Approximation Trade-off via the 
Bias—Variance Analysis 


The bias—variance analysis is another way of measuring or quantifying the 
generalization—approximation trade-off. The analysis is generally done using re- 
gression with mean squared error as the success measure, but it can be modified for 
classification [HTF09]. The equation for bias—variance trade-off is given by: 


Ex[(y — A(x))*] = Ex[(A(x) —A(x))*] + F(x) - A(x) +E[— f(%))"] (2.16) 


Variance Bias2 Noise 


The idea of the bias—variance trade-off is to decompose the out-of-sample regres- 
sion error (y —h(x))? in terms of three quantities: 


Variance: The term (h(x) — h(x))* corresponds to the variance of h(x) and is 
caused by having too many hypotheses in the 1 set. The term h(x) corresponds 
to the average hypothesis from the entire set 1. 


Bias: The term (f(x) — h(x))? corresponds to the systematic error caused by 
not having a good or sufficiently complex hypothesis to approximate the target 
function f(x). 


Noise: The term (y— f(x))* corresponds to the inherent noise present in the data. 


In general, a simple model suffers from a large bias, whereas a complex model 
suffers from a large variance. To illustrate the bias—variance trade-off, we use again 
a one-dimensional sin(x) as a target function with added Gaussian noise to generate 
data points. We fit polynomial regression with various degrees, as shown in Fig. 2.5. 
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The bias—variance trade-off is clearly evident in Fig. 2.5 and Table 2.1, which lists 
the variance, bias, and noise in each case. A simple degree 1 model has higher 
bias error contributing towards underfitting while a complex degree 12 shows huge 
variance error contributing towards overfitting. 


Table 2.1: Bias, variance, and noise errors for polynomials of degree | and 12 


Hypothesis Bias error Variance error Noise error Total error 
Degree | 0.1870 0.0089 0.0098 0.2062 
Degree 12 0.0453 2.4698 0.0098 2.5249 


2.4.3 Model Performance and Evaluation Metrics 


Above, we have evaluated the performance of an algorithm or model using classifi- 
cation or regression error as a metric of success. In general, there are many metrics 
defined for supervised learning (in both the classification and regression domains) 
that depend on the size of the data, the distribution of the labels, problem mapping, 
and more. Below, we describe a few. 


2.4.3.1 Classification Evaluation Metrics 


We will consider the simple case of binary classification. In the classification do- 
main, the simplest visualization of the success of a model is normally described us- 
ing the confusion matrix, shown in Fig. 2.6. Accuracy and classification error are 
informative measures of success when the data is balanced in terms of the classes; 
that is, the classes have similar sizes. When the data is imbalanced, 1.e., one class is 
represented in larger proportion over the other class in the dataset, these measures 
become biased towards the majority class and give a wrong estimate of success. 
In such cases, base measures, such as true positive rate (TPR), false positive rate 
(FPR), true negative rate (TNR), and false negative rate (FNR), become useful. For 
instance, metrics such as F/ score and Matthews correlation coefficient (MCC) com- 
bine the base measures to give an overall measure of success. Definitions are listed 
below. 
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Class Negative False Positive True Negative 
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Fig. 2.6: Confusion metrics for binary classes 


1. True positive rate (TPR) or recall or hit rate or sensitivity 


TP 


TPR = ——~ 2.17 
(TP+FN) nn) 
2. Precision or positive predictive value 
TP 
Precision = ————__— (2.18) 
(TP+FP) 
3. Specificity 
TN 
Specificity = — —_—__ ZAG 
pecificity (TN -+FP) ( ) 
4. Negative prediction value 
TN 
NPV = ————~ (2.20) 
(TN + FN) 
5. Miss rate or false negative rate 
FN 
FNR = ——~ (2.21) 


(TP+FN) 
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6. Accuracy 


TN+TP 
Accuracy = fae (2.22) 
(TP+FN+FP+TN) 
7. Fl score 
F1=2 Precision X Recall (2.23) 


(Precision + Recall) 
8. Matthews correlation coefficient (MCC) 


TPXTN-—FPXFN 
NC SS Se, 0 
(TP-+FP) x (TP+FN) x (TN+FP) x (TN+FN) 


Many classification models provide not only a prediction of the class, but also 
a confidence value between 0 and 1 for each data point. The confidence thresh- 
old can control the performance of the classifier in terms of TPR and FPR. The 
curve that plots TPR and FPR for a classifier at various thresholds is known as the 
receiver-operating characteristic (ROC) curve. Similarly, precision and recall can 
be plotted at different thresholds, giving the precision-recall curve (PRC). The ar- 
eas under each curve are then respectively known as auROC and auPRC and are 
popular metrics of performance. In particular, auPRC is generally considered to be 
an informative metric in the presence of imbalanced classes. 


2.4.3.2 Regression Evaluation Metrics 


In the regression domain, where the predicted output is a real number that is com- 
pared with the actual value (another real number), many variants of squared errors 
are employed as evaluation metrics. We list a few below. 


1. Average prediction error is given by: 


pr 7) 
n 


—— (2.25) 
where y; corresponds to the actual real-valued label and ¥; is the predicted value 
from the model. 

2. Mean absolute error (MAE) treats the positive and negative errors in equal 
measure and is given by: 


MAE = 


ae Yi — ¥j| (2.26) 
nN 


3. Root mean squared error (RMSE) gives importance to large errors and is 
given by: 
y= ar 
n 


RMSE — (2.27) 
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4. Relative squared error (RSE) is used when two errors are measured in differ- 
ent units: 

are 

1 (Wi — Yi)? 


5. Coefficient of determination (R7) summarizes the explanatory power of the 
regression model and is given in terms of squared errors: 


RSE = (2.28) 


n 


SSEvesidual = » 0% —¥) (2.29) 
f=! 
SSEtorat = >, (Vi — Ji)” (2.30) 
i=l 
SOE 
R2 ae ee esidual (2.31) 
SSE total 


2.4.4 Model Validation 


Validation techniques are meant to answer the question of how to select a model(s) 
with the right hyperparameter values. When there are many hypotheses in the hy- 
pothesis set, then each unique hypothesis is trained on the training set Dj;gjn and then 
evaluated on the validation set D,,,;; the model(s) with the best performance metrics 
is then chosen. Logically, the model h~ (superscript denotes the model trained on 
less data) is trained on a training dataset which has fewer points M as compared to 
the entire set, as some data, K in all, are in the validation set. The performance on 
the validation set is then given as: 


Eyqi(h”) = = > c(h” (Xn), Yn) (2.32) 


n=1 


Using the VC bound related above, we can show that: 


Fout (h’ ) < Eva (h’ ) +O (=) (2.33) 


This equation shows that the larger the number of validation data points 
K, the better the estimation of the out-of-sample error is. However, from the 
learning curves, we now know that the more the training data, the smaller 
the training error. Thus, by removing K points from the budget of training, 
we have theoretically increased the chances of having a larger training error. 
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This gives rise to an interesting learning paradox: we need a large number of 
validation points to have a good estimate of the out-of-sample error, while at 
the same time, for a model to be trained better, we need fewer data points in 
the validation set. 

A way of addressing this paradox in practice is by training the model the 
model using only the training data D;;qin, using the model h~ on the validation 
data Dy, to estimate the error E,,;(h~ ), and then adding the data back to the 
training set to learn a new model fA on Dyrgin + Dyq. This is known as the 
validation process and is illustrated in Fig. 2.7. Putting it all together, this 
allows us to obtain the following upper bound on Eo,; (hs): 


Eou(h) < Eour(h7) < Evai(h7) +0 (= (2.34) 


An important point to note is that when the validation set has been used for 
model performance evaluation, the estimates of out-of-sample errors derived 
from the validation errors are optimistic; that is, the validation set is now a 
biased set, because we use it indirectly to learn the hyperparameters of the 
models. The validation process is a simple method that can be used for model 
selection and is independent of the model or learning algorithm. 


The only drawback with the validation process is the need to have a large num- 
ber of labeled data points for creating the training set and the validation set. It is 
normally difficult to collect a large labeled set due to the cost or difficulty in ob- 
taining the labels. In such cases, instead of physically separating the training set 
and validation set, a technique known as k-fold cross-validation is used. The k-fold 
cross-validation algorithm is shown in Fig. 2.8. First, the data is randomly divided 
into k sets. Then, in each of the resulting k experiments, k — 1 data folds are used 
for training and onefold is used for validation to measure the ES , tor a fold. Finally, 
the average of the & validation errors is used as a single estimate of the validation 
error Eyq/. 
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Fig. 2.7: Model training and validation process 





























Drain 
Exp = 1 D, D, D, Dy D, D, | Dy Dg | Dy | Dy 
Dyai Drrain 
ra at a We “ ¥ r 1 a ve ¥ : ¥ ] . ? 1 DB 
Exp=2 | 0, D, Ds Dy D; | Dy D, | Dg | Dy | Dy | Eva _- >. gk 
iF , h, A } ‘ A A L } Val ~ an Pal 
YY" ie 
Drain Dyar Drain 
eee 
Exp = 10 | Dd, dD, Dy; Dy, D, | Dy, D; Dy Dy Dyo | Eval 
A, A 














Drain Dyat 


Fig. 2.8: Illustration of tenfold cross-validation 


We highlight the validation process involving a hypothesis with a single parame- 
ter A = (Ay, A2..Ay) with M finite values in search of the best parameter. 
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Algorithm 1: FindBestParameters 


Data: Dryain{k], Dya|k] 
Result: bestParameter, lowestError 
begin 
splitDataFolds: Create folds of labeled dataset D,k 
bestParameter < Ap 
for m € 1..M do 
| AHAm 


for ic 1..k do 
|. trainModel(h(A ), Dryain|i]) Evar|i] <— testModel(h(A ), Dyailil) 
Evar = ¢ Dfe1 Eva ik) 
if current value is the best seen then 
lowestError < Eyaz 
bestParameter < /,, 


2.4.5 Model Estimation and Comparisons 


Once the hypothesis or the model with the best parameters is selected and trained 
with the entire Diygin U Dy set, the test set Dyes; 18 then used to estimate the test 
error. Recall that the PAC equation is a function of the hypotheses set size M and 
dataset size N. Therefore, when only one hypothesis is considered, M = 1, in the 
presence of a sufficiently large test set, the PAC equation confirms that the test error 
is a good approximation of the out-of-sample error. Since the test set has not been 
used for either model training or model hyperparameter selection, the error remains 
an unbiased estimate in contrast to the training or validation error. 

If a comparison needs to be made between two or more classifiers on one or more 
datasets, to obtain statistical estimations of differences of metrics, various statistical 
hypothesis testing techniques can be employed. One has to be aware of the assump- 
tions and constraints of each technique before selecting one [JS11, Dem06, Die98]. 
Let us describe a few statistical hypothesis testing techniques. 

Resampled t-test 1s a parametric test (in that it assumes a distribution) that is 
used to compare two classifiers on a metric such as accuracy or error via n different 
random resampling of training and test subsets from a single dataset. If p; is the 
difference between the performance of the two classifiers and is assumed to have a 
Gaussian distribution, and p is the average performance difference, the t-statistic is 
given by: 

nee (2.35) 
> ian (Di =p ie 
(n—1) 

McNemar’s test is a popular non-parametric test used to compare two classifiers 
on the same test set. The test uses the counts given in Table 2.2. 

The null hypothesis assumes 7119 no, and the statistic is given by: 
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Table 2.2: Basic counts of errors and correct for two classifiers on a test set 


Classifierz error|Classifiery correct 


Classifier; error noo No| 
Classifier; correct N10 N11 





_ _ 6) 
nn — sol = tol = 1)" (2.36) 
(no, +740) 


The Wilcoxon signed-ranks test is a non-parametric test that is popular when two 
classifiers have to be compared on multiple datasets. The test ranks the differences 
between metrics of the two classifiers under comparison on the i-th dataset of N 
datasets as d; = pi — D ignoring the signs. The statistic is given by 


T =min(R*,R’ ) (2.37) 


where 


] 
R* = ¥ rank(dj) += >), rank(d;) 


and 


R~ = ¥ rank(d;) +5 SY) rank(d;) 
dj<0 d;=0 
The Friedman non-parametric test with Iman—Davenport extension 1s used when 
there are multiple classifiers to be compared on multiple datasets, a scenario that is 
very common when presenting a novel classifier or aiming to select a single best 
classifier across many on a given dataset. If R; is the rank of the jth classifier among 
K classifiers in the ith dataset among WN datasets, then the statistic is given by: 


—1)y2 
pa NUE (2.38) 
K(K—1)— yp 
where 
120 Ve K(K+1) 
2 2 
AF R(K41) 2 J 4 








2.4.6 Practical Tips for Machine Learning 


Though not true in every scenario, there are many practical tips for machine 
learning practitioners. We list a few here. 
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e In an unbalanced dataset, when one has enough data, it is a good idea to 
create stratified samples of the training and test set. The general rule of 
thumb is to have a number of samples that is at least ten times the num- 
ber of dimensions, if possible. Generally, 20% of the data is set aside for 
testing. 

e Ifthe labeled dataset is very sparse (that is, the number of samples is much 
smaller than the number of dimensions), instead of dividing the dataset into 
training and test, one can use the cross-validation process for both model 
selection and estimation. One has to be aware that the error metric will be 
optimistic and may not reflect the real out-of-sample error. 

e To obtain a good error estimate in an unbalanced dataset, the test set should 
have a similar proportion of positives and negatives as the general popula- 
tion estimate. 

e The test set needs to have similar data characteristics as the general popu- 
lation estimate. This includes feature statistics and distributions. 

e Inan unbalanced dataset, creating many samples of the same minority class 
with different majority class is often useful for both training and testing. 
The variance of the error estimate across various sets provides an important 
metric of the sample bias. 

e Use the training set for training and use cross-validation on the training set 
for hyperparameter selection. 

e The training sample size in an unbalanced dataset can be oversampled or 
undersampled. The ratio of the two classes (in binary classification) is an- 
other hyperparameter to learn. 

e Always plot the learning curves on the validation set (cross-validation av- 
erage with variance on folds of the training set) to evaluate the number of 
instances needed for a given algorithm. 


2.5 Linear Algorithms 


Let us now consider similar issues in the context of regression. In this chapter, we 
will limit our discussion to linear regression. We will give basic introduction and 
equations around each algorithm followed by discussion points capturing advan- 
tages and limitations. 


2.5.1 Linear Regression 


Linear regression is one of the simplest linear models that has been analyzed theo- 
retically and applied practically to many domains [KK62]. The dataset assumes the 
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labels to be numeric value or a real number; for instance, one could be interested in 
predicting the house price in a location where historical data of previous house sales 
with important features such as structure, rooms, neighborhood, and other features 
have been collected. Due to the popularity of linear regression, let us go over some 
important elements, such as concepts, facets of optimization, and others. 

The hypothesis / is a linear combination of input x and a weight parame- 
ters w (that we intend to learn through training). In a d-dimensional input (x = 
[x1,X2...Xqg]), we introduce another dimension called the bias term, xo, with value 
1. Thus the input can be seen as x € {1} x R®, and the weights to be learned are 
we Rt, 


d 
h(x) = > wix; (2.39) 
i=0 


In matrix notation, the input can be represented as a data matrix X € R” Cae 
whose rows are examples from the data (e.g., x,), and the output is represented 
as a column vector y € R”. We will assume that the good practice of dividing the 
datasets into training, validation, and testing is followed, and we will represent the 
training error by Efrain. 

The process of learning via linear regression can be analytically represented as 
minimizing the squared error between the hypothesis function h(x,) and the target 
real values y,, as: 


d 


Etrain (h(x, W)) a x yy (W'Xn — Yn) (2.40) 


Since the data x is given, we will write the equation in terms of weights w: 


Etrain(W) = —||((Xw—y)? ||, (2.41) 


I 
ON 
where ||(Xw _— y)*|| is the Euclidean norm of a vector. 

So, we can write 


1 
Etrain (w) = N (w' X'Xw 7? 2w' X'y oF yy) (2.42) 
We need to minimize Fj,gin. This is an optimization problem, as we need to find 


the weights W,,; that minimize the training error. That is, we need to find: 


Wopt = arg min Ejrain(W) (2.43) 
weRd@tl 


We can assume that the loss function Ej;gin(w) is differentiable. So, to obtain a 
solution, we take the gradient of the loss function with respect to w, and set it to the 
zero vector 0: 


2 
VEtrain(W) = 7 (X'Xw—X'y) =0 (2.44) 
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X'Xw=Xly (2.45) 


We also assume that X!X is invertible, and so we obtain 


Wop: = (X'X)'Xty (2.46) 


We can represent the pseudo-inverse as X", such that X' = (X'X)~!XT. This 
derivation shows that linear regression has a direct analytic formula to compute for 
the optimum weights, and the learning process is as simple as computing the pseudo- 
inverse matrix and the matrix multiplication with the label vector y. The following 
algorithms implement the described optimization process. 


Algorithm 2: LinearRegression train 


Data: Training Dataset (x1, y,), (X2,2),-.(Xn,n) such that x; € R¢ and y; € R@ 
Result: Weight vector w € R¢+! 
begin 

create a matrix X from inputs and a bias for each vector xo = 1 

create a vector y from labels 


compute the pseudo-inverse X' = (X?X)~!X? 
w=X'y 


Algorithm 3: LinearRegression predict 


Data: Test Data x such that x € R?@ and weight vector w 
Result: Prediction ¥ 
begin 
L create a vector x from inputs and prefix the input vector with the bias term x9 = 1 


j=x!w 


2.5.1.1 Discussion Points 


e Linear regression has an efficient training algorithm, with time complexity poly- 
nomial in the size of the training data. 

e Linear regression assumes X" is invertible. Even if this is not the case, the 
pseudo-inverse can be employed, though doing so does not guarantee a unique 
optimal solution. We note that there are techniques that can compute the pseudo- 
inverse without inverting the matrix. 

e The performance of linear regression is affected if there is correlation among the 
features in the training set. 
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2.5.2 Perceptron 


Perceptrons are models based on the linear regression hypothesis composed with the 
sign function that provides a classification output instead of regression, as shown 
below, and illustrated in Fig. 2.9. 


d 
h(x) = sign (3 va) (2.47) 


i=0 
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Fig. 2.9: Perceptron 


The training algorithm for perceptrons in a linearly separable dataset is to initial- 
ize the weights and iterate over the training set, changing the weights only when data 
points are wrongly classified [Ros58]. This is an iterative process that converges 
only when the dataset is linearly separable. For linear but not separable datasets 
(having small number of labels on either side of the plane), a small modification 
is made to iterate only up to a maximum number of iterations and store the loss 
function with weights corresponding to the lowest loss function. This is known as 
the pocket algorithm. The perceptron training algorithm tries to find a hyperplane 
of d — 1 dimension in a d-dimensional dataset. 


2.5.2.1 Discussion Points 


e Perceptrons need not find the best hyperplane (maximum separation between 
points) separating the two classes and suffer from noise in the datasets. 

e Outliers impact the algorithm’s ability to find the best hyperplane and so address- 
ing them is important. 
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Algorithm 4: Perceptron 


Data: Training Dataset (xy, 7), (X2,y2),-.(Xn,n) such that x; € R@ and y; € (+1,—1), 
MaxIterations= T 
Result: Weight vector w € R?@*! 
begin 
create a vector x from inputs and prefix the input vector with the bias term x9 = 1 
create a vector y from labels 
initialize weight vector wo to be 0 
bestWeight <— wo 
initialize loss(w) to be 1 
for + < 0..T—1do 
for 1 € 0..V—1do 
if sign(x;y;) ~ y; then 
update the weight vector W(;4.1) = Wr) + XiVi 
currentLoss(w) <— currentLoss( Ww; 41)) 


if currentLoss(w) < loss(w) then 
loss(w) < currentLoss(w) 
bestWeight = w(;+1) 


return bestWeight 


Algorithm 5: Perceptron 
Data: Test Data x such that x € R@ and weight vector 
Result: Prediction § € (+1,—1) 
begin 
create a vector x from inputs and adding a bias for each vector xp = | 
$= sign(x! w) 
return y 


2.5.3 Regularization 


As discussed earlier, one of the common issues in supervised machine learning is 
overfitting. Equation 2.15 can be seen as a penalty on the model complexity. If 
this penalty is taken into account in the training of the model, then learning is im- 
proved. The idea of regularization is to do just that; that is, to introduce this penalty 
in the training itself. Regularization can be generally considered as an application 
of Occam’s Razor in that the goal is to choose a simpler hypothesis. In general, 
regularization is used to combat the noise inherent in the dataset. 

In many weight-based machine learning algorithms, such as linear regression, 
perceptrons, logistic regression, and neural networks, it is common practice to put 
a penalty on the weights and introduce that in the loss function. The resulting aug- 
mented loss function that is then used for optimization is given below: 


Eaug(h) = Errain(h) + 4Q(w) (2.48) 
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In the augmented loss function above, the scalar parameter A is known as the 
regularization constant, and Q(w) is known as the regularization function. 


2.5.3.1 Ridge Regularization: L> Norm 
One of the popular regularization functions is the L2 norm [HKOO], also known 
as weight decay or ridge regularization. It can be substituted in Eq. 2.49 as shown 


below: 


Eaug(h) = Eprain(h) + Aw! w (2.49) 


Therefore, one can search for the optimal Wop;, defined as: 


Wopr = arg min Eque(W) (2.50) 
weR@tl 
Wopt = argmin (Eprain(h) + Aw'w) (2.51) 
weRdt1 


The linear regression solution modified with regularization 1s: 


Wop: = (X'X+AI)'X'y (2.52) 


The regularization parameter A is normally selected using the validation tech- 
nique described above for any hyperparameter and is generally a small value 
around 0.001. The impact of Lz regularization is that some weights which 
are less relevant will have values closer to zero. In this way, Lz regularization 
can be seen as conducting implicit feature selection via feature weighting. L» 
regularization is computationally efficient. 


2.5.3.2 Lasso Regularization: L; Norm 


The Z; norm is another popular regularization used in weight-based algo- 
rithms [HTFO9]: 


Wopt = argmin( Ejrain(h) +A|w}) (2.53) 


Due to the absolute function, the above equation does not have a closed-form so- 
lution and is generally represented as a constrained optimization problem as below: 


argmin (X'Xw—X'y)s.t.w<7 (2.54) 
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The parameter 7) is inversely related to the regularization parameter 1. The equa- 
tion above can be shown to be a convex function, and quadratic programming is 
generally used to obtain the optimized weights. 


As in Ly regularization, the regularization parameter A in L; regularization is 
selected using validation techniques. In comparison with Lz, L; regularization 
generally results in more feature weights being set to zero. Thus, L; regular- 
ization yields a sparse representation through implicit feature selection. 


2.5.4 Logistic Regression 


Logistic regression can be seen as a transformation @ on the linear combination x! w 


that allows a classifier to return a probability score [WD67]: 


h(x) = 0(w'x) (2.55) 


A logistic function (also known as a sigmoid or softmax function) 6(w!x), shown 
below, is generally used for the transformation: 


ioe exp(w! x) 


~ 1+exp(w'x) abe. 


For a binary classification, where y € {—1,+1}, the hypothesis can be seen as 
a likelihood of predicting y = +1, i.e., P(y = +1|x). Thus, the equation can be 
rewritten as a log-odds ratio, and weights are learned to maximize the conditional 
likelihood given the inputs: 





Py=+x) or 


The log-likelihood of the hypothesis can be written as: 


n 


logh(x) = } log P(i|xi) (2.58) 
i=0 
: x)) = "| logh(x;) if y= +1 
ihe ah —logh(x:)) if y= =I a 
log £(h(x)) = ¥(yrlogh(x:) + (1 —y,)(1 —loga(x;))) (2.60) 


i=0 
In information theory, if one treats y; and h(x;) as probability distributions, the 
above equation is referred to as cross-entropy error. This cross-entropy error can 
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be treated as our new error function F;,;,, but 1t cannot be solved in closed form. 
Instead of an analytical solution, an iterative algorithm known as gradient descent 
can be employed. Gradient descent is a general optimization algorithm that is used 
widely in machine learning, including deep learning. Let us discuss it at some length 
below. 


2.5.4.1 Gradient Descent 


Let us recall that the goal is to find weights w that minimize F;,;gjn, and that at the 
minimum, the gradient of F;;gjn 1s 0. In gradient descent, the negative of the gradient 
is followed in an iterative process until the gradient is zero. The gradient is a vector 
containing partial derivatives over each of the dimension [Bry61], as shown below: 


7 OE train OF train O Eran 








= VE pai = = 2.61 
‘ train (W) Owo = Ow OWn on 
The normalized gradient ¢ can be written as: 
pe VEtrain (w) 
¢ = (2.62) 
|| VErrain(W)| 


A small step size 1 1s made in the direction of —g, and the weights are updated 
accordingly, leading to an optimal point. Selecting a small step size is important, 
otherwise the algorithm oscillates and does not reach the optimum point. The algo- 
rithm can be summarized as: 


Algorithm 6: Gradient descent 


Data: Training Dataset Djrain = (X1,¥7),(X2,Y2),--(Xn, Yn) Such that x; € R4 and 
y; € [+1,—1], Loss Function Ej;gin(w), Step size 1 and MaxIterations T 
Result: Weight vector w € R¢+! 
begin 
Wo + init(w) 
for t+ < 0..T—1do 
| 2 <— V Etrain (Ws ) 
Wri — W — 18; 
return W 


The weights w can be initialized to the 0 vector or set to random values (each 
obtained from a normal distribution with 0 mean and small variance) or preset 
values. Another important decision in gradient descent is the termination cri- 
terion. The algorithm can be made to terminate when the number of iterations 
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reaches a specific value or when the value of the gradient reaches a predefined 
threshold, close to zero. 


2.5.4.2 Stochastic Gradient Descent 


One of the disadvantages of gradient descent is the use of the entire training dataset 
when computing the gradient. This has an implication on the memory and com- 
putation speed, which increase as the number and dimension of training examples 
increase. Stochastic gradient descent is a version of gradient descent that, instead 
of utilizing the entire training dataset, picks a data point uniformly at random from 
the training dataset (hence the name stochastic). It has been shown that with a large 
number of iterations and a small step size, stochastic gradient descent generally 
reaches the same optimum as the batch gradient descent algorithm [BB08]. 


Algorithm 7: Stochastic gradient descent 


Data: Training Dataset Djrain = (1,7), (X2,Y2),-.(Xn, Yn) Such that x; € R¢ and 
y; € (+1,—1), Loss Function Ej;gin(w), Step size 7 and MaxIterations T 
Result: Weight vector w € R@*! 
begin 
Wo < init(w) 
for t € 0..T —1do 
d < (Xi, yi) 
g, ~ VEq(w;) 
Wr+1 <— Wr — 181 
return W 


Figure 2.10 illustrates the iterative changes in the training error for (batch) gradi- 
ent descent and stochastic gradient descent for a one-dimensional linear regression 
problem. 

It can be shown that in logistic regression the gradient 1s: 

n 
VE train (w) =—— is a (2.63) 
N j=9 (1 + expi¥ *) 


The training of logistic regression using gradient descent is described by the 
following algorithm: 
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Fig. 2.10: One-dimensional regression with gradient descent and stochastic gradient 
descent 


Algorithm 8: Logistic regression with gradient descent 


Data: Training Dataset (xy,y;),(X2,y2),..(Xn,¥n) such that x; € R@ and y; € (+1,—1), 
MaxIterations= T and Step size 7 
Result: Weight vector w € R@*! 
begin 
create a vector x from inputs and adding a bias for each vector xp = | 
for t+ € 0..T—1do 
lyn iXi 
| BR Te) 
Wr+1 <— Wr — 12 
return W 


2.5.5 Generative Classifiers 


All algorithms we have seen so far have been discriminative in their approach; that 
is, they make no assumption about the underlying distribution of the data and focus 
instead on the end goal of prediction accuracy. Another popular approach in machine 
learning is the generative approach, which assumes an underlying distribution with 
which the data is generated and tries to find parameters of this distribution in its 
training. 

The generative approach, though an indirect mechanism for achieving the predic- 
tion accuracy, has been very successful in real-world applications. Many machine 
learning algorithms, both supervised and unsupervised, naive Bayes, linear discrim- 
inant analysis, expectation maximization, and Bayes networks among others, are 
based on the generative approach and have a probabilistic foundation in the Bayes 
theorem. 

Formally, given a hypothesis / and a training dataset D;;gin, the Bayes theorem 
helps in defining the probability of choosing the hypothesis given the data; that is, 
it helps define P(h|Dj,ain) given the prior probability of the hypothesis P(h), the 
likelihood of the data given the hypothesis P(Dj;ain|h), and the probability of data 
over all hypotheses P(Dyrain) = Jf, P(Dirain|h) as: 
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P(Deprain|h)P(h) 


PUD) = P(D, ) 


(2.64) 


If there are multiple hypotheses, the question of which one is the most probable 
one given the training data can be answered by the maximum a posteriori hypoth- 
eSIS as: 


hyap = argmin P(h| Dyain) (2.65) 
he XH 


P( Diyain|h)P(h) 


h = arg max (2.66) 
aa Fee P(Drrain) 
Since P(Dyrain) is independent of h, we have: 
hyap = argmax P(Dyrain|h)P(h) (2.67) 


heH 


If we further assume that all the hypotheses are equally likely (1.e., P(h,) = 
P(hz) © P(h,) for m hypotheses), the equation can be reduced to: 


hy = arg max P(‘Dprain|h) (2.68) 
he 
As stated in the assumptions, if the training examples are independent and identi- 
cally distributed (i.i.d), P(D;,ain|h) can be written in terms of the training examples 
as: 


N N 


P(Derain|h) = L] PCy) = L] Povibaish) Px) (2.69) 


2.5.5.1 Naive Bayes 
The hypothesis in Bayes form for a binary classification y; € (0, 1) is: 


hpayes(X) = argmax P(X = x|Y = y)P(Y =y) (2.70) 
ye (0,1) 
In naive Bayes, an assumption of independence between the features or attributes 
is made. So, for d dimensions, the equation simplifies as: 


d 
hgayes(X) = argmax P(Y = y) ]]?% = Y= y) (2.71) 
ye (0,1) j=! 

As a result, training and estimating parameters of naive Bayes is just measuring 
two quantities, the priors for the class P(Y = y) and the conditional for each fea- 
ture P(X; = x;/Y = y) given the class. It can be easily shown that the maximum 
likelihood estimates of these are nothing but counts in discrete datasets as shown 
below: 
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Le count Label(y) 
P(Y =y)=—Y fy; =y] = ———_— 2.72 
(Y=y) v 2 y] 7 (2.72) 
i= ax; ;= 
6 ee ee a lac ad (2.73) 


N 
Prediction for new examples can be done using the estimations and Eq. 2.70. 


2.5.5.2 Linear Discriminant Analysis 


Linear discriminant analysis (LDA) is another generative model, where the assump- 
tion of a Gaussian distribution for P(X |Y ) is made along with equal priors for binary 
classes, i.e., P(Y = 1) = P(Y =0) = 1/2. Formally, pp € R? is the multivariate mean, 
and » is the covariance matrix. Then, we have: 


P(X =x|¥ =y)P(Y =y) = amas? (S-6- ayn) 


(2.74) 
The training of LDA, similar to naive Bayes, involves estimating the parameters, 
(u and 2 here) from the training data. 


2.5.6 Practical Tips for Linear Algorithms 


1. Itis always a good idea to scale the input real-valued features to the range 
(0, 1] for gradient descent algorithms. 

2. Binary or categorical features which are represented as one-hot vectors 
can be used without any transformations. In one-hot vector represen- 
tation, each categorical attribute 1s converted into k boolean valued at- 
tributes such that only one of those k attributes has a value of one and rest 
zero for a given instance. 

3. Grid search over a range of values spanning multiple orders of magni- 
tude should be used to determine the learning rate and the regularization 
parameter. 


2.6 Non-linear Algorithms 


2.6 Non-linear Algorithms 


The algorithms we have seen so far, given by sign(w'x), are linear in the weights w, 
as the inputs x are a given or constant for the training algorithm. A simple extension 
is to use a non-linear transform (x) applied to all the features, which transforms 
the points into a new space, say Z, where a linear model given by sign(w! @(x)) 
can then be learned. When a prediction is needed on a new unseen data x, first the 
data is transformed into the Z space using the transformation @(x), and then linear 


algorithm weights are applied in the Z space to make the prediction. 


As an example, a simple non-linear, two-dimensional training dataset can 
be transformed into a three-dimensional Z space, where the dimensions are 
Z = (x1,xX2,x; +x5). The 2 space is linearly separable, as shown in Fig. 2.11. 
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Fig. 2.11: Illustration of non-linear to linear transformation and finding a separating 
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2.6.1 Support Vector Machines 


Support vector machine (SVM) 1s one of the most popular non-linear machine learn- 
ing algorithms that can separate both linear and non-linear data using built-in trans- 
formations known as kernels [Vap95]|. SVMs not only separate the data but also find 
the hyperplane that separates the data in the most optimal way through a principle 
known as maximum margin separation, as shown in Fig. 2.12. The data points that 
separate the hyperplane and lie on the margin are known as support vectors. 


Margin 


ay —_—— 7 
f 7 


Training Data, Class -1 


/\, Training Data, Class 1 
N / 





x4 


Fig. 2.12: Illustration of SVM finding the maximum margin separation between 
labeled data 


In SVM the hyperplane is obtained by the kernel transformation k(x,x’), which 
takes any two data points x and x’ and obtains the transformed real value in the inner 
product space: 


y=b+> ojk(x,x ) (2.75) 


The idea of kernels is to implicitly perform the non-linear transformation onto 
the Z space without any explicit transformation @(x) through a concept known as 
the kernel trick. The radial basis function, also known as the Gaussian kernel, is 
one such kernel transformation: 


12 
—|x=x | 


k(x,x ) =exp o (2.76) 





Gaussian kernels can be shown to map the input space into an infinite-dimensional 
feature space. The transformation shown in Fig. 2.11 can be generalized to a poly- 
nomial kernel of degree o given by: 


k(x,x ) =(1+xx )° (2.77) 
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2.6.2 Other Non-linear Algorithms 


The k-nearest neighbors algorithm is another simple non-linear algorithm. It is also 
known as the /azy learner, as its core idea is to hold all the training data in memory 
and use the distance metric for the user-specified & (number of neighbors) to classify 
the unseen new data point. Generally, a distance metric such as Euclidean or Man- 
hattan, generalized as the Minkowski distance, is used to compute the distance 
from the points: 


dist (x, x ) -(% x —x )' (2.78) 


Neural networks are another extension of perceptrons that are used to create non- 
linear boundaries. We discuss them in detail in Chap. 4. Decision trees and many of 
its extensions, such as the gradient boosted algorithm and random forest among 
others, are based on the principle of finding simpler decision boundaries for the 
features and combining them in hierarchical trees, as shown in Fig. 2.13. 


(gj Training Data, Class -1 
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Fig. 2.13: A two-dimensional example and classifier boundaries to separate two 
classes using decision trees 


2.7 Feature Transformation, Selection, and Dimensionality 
Reduction 


In this section we will review some of the common techniques used for feature 
transformation, selection, and reduction. 
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2.7.1 Feature Transformation 


In many algorithms, it is beneficial to have all the features in the same range, for 
example, in [0,1], for the algorithm to not be biased or run effectively across all 
features. Some commonly used transformations are as follows. 


2.7.1.1 Centering or Zero Mean 


Each feature can be transformed by subtracting the mean from its value, ffeature,i = 


De Sf feature’ 


2 1, = 
f feature = N De (2.79) 
i=0 


2.7.1.2 Unit Range 


Each feature can be transformed to be in the range of [0,1]. For a feature fyeature, 
let ffeatureMax correspond to the maximum value in the dataset, and ffearureMin 
correspond to the minimum value. Then the transformation for an instance / 1s: 


6¢ 1 Ff featureMax) 


(f... Mav... Min (2.80) 
(ffearureMax == Ff featureMin) 


i= 


2.7.1.3 Standardization 


In this transformation, the features change to have zero mean and unit variance. The 
empirical variance of a feature V feature 1s calculated on a dataset by: 


I : 
V feature = N SG = fein) (2.81) 
i=0 
: : i on Gata) 
The transformation for each feature is f; = aa a 


2.7.1.4 Discretization 


Continuous features are sometimes transformed to categorical types by defining the 
number of categories or the category width. 
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2.7.2 Feature Selection and Reduction 


We have already seen that regularization with L; or Lz can be considered as a feature 
scoring and selection mechanism and can be employed directly in an algorithm to re- 
duce or prioritize the impact of features. There are many univariate and multivariate 
feature selection approaches that use information-theoretic, statistical-based, sparse 
learning-based, wrapping algorithms for finding features [GE03, CS14]. There are 
various dimensionality or feature reduction techniques to transform and reduce the 
feature set to smaller subset of more meaningful features. In this section, we will 
highlight one such statistical-based method known as principal component analysis 
(PCA), which is also applicable to deep learning techniques. 


2.7.2.1 Principal Component Analysis 


PCA 1s a linear dimensionality reduction technique which, given an input matrix 
X, tries to find a feature matrix W such that the size m of the feature matrix W is 
much lower than the input dimension d (m < d) and each reduced feature in this 
new feature matrix captures maximum variance from the inputs [Jol86]. This can be 
considered as the process of finding the matrix W such that the weights decorrelate 
or minimize relationships between features (Fig. 2.14). 

This can be expressed as below: 


(WX)' (WX) = (Z)'(Z) =NI (2.82) 
samples (n) dimensions(m) pa samples (n) 
S S = 
3 = e 
¢ s £ 
Weight Matrix (Z) Feature Matrix(W) Input Matrix(X) 


Fig. 2.14: PCA process of finding reduced dimensions from original features 


WX! XW! =NI (2.83) 
Solving for the diagonalization, the above equation becomes: 
WCov(X)W'! =I (2.84) 


Covariance matrices are positive semi-definite, symmetrical, have orthogonal 
eigenvectors, and real-valued eigenvalues. Matrix A can be factorized as UAU! = A, 
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where U has the eigenvectors of A in its columns, and A = diag(A;), where A; are 
the eigenvalues of A. 

Thus the solution for WCov(X)W! is a function of the eigenvectors U and eigen- 
values A of the covariance matrix, i.e., Cov(X). The algorithm is shown below. 


Algorithm 9: PCA 


Data: Dataset X = [x),x2..xy| € R“, Components = m 
Result: Transformed Data Y € R” 
begin 

X ¢ [x) — U,X2 —L..Xy — Ul] 

S, + 7Xx! 

X'x=vwav! 

Ue Xva-? 

UU, — [U;, U2..U)p| 

Y<+uU!x 

return Y 


2.8 Sequence Data and Modeling 


In many sequence data analysis problems, such as language modeling, time series 
analysis, and signal processing, modeling the underlying process as a Markov pro- 
cess has been very successful. Many traditional NLP tasks, such a parts-of-speech 
tagging, extracting information, and phrase chunking, have been modeled very suc- 
cessfully using hidden Markov models (HMM), a special type of Markov pro- 
cesses. In next few sections, we will discuss some important theory, properties, and 
algorithms associated with Markov chains [KS+60]. 


2.8.1 Discrete Time Markov Chains 


Markov chains are the basic building blocks for modeling many sequential pro- 


cesses. Consider a finite set of states modeled as S = {s1,52,...,5,}, and let a vari- 
able g represent the transition at any time ¢ as g;. An illustration is provided in 
Fig. 2.15. 


The Markov property states that at any time r, the probability of it being in a state 
s; depends only on the previous k states rather than all the states from time | tot — 1. 
This can be expressed as: 


P(a = Si|Gt—15 G12, 41) = P(g = SilGt—15 Gt—2, sg (2.85) 
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dt-1 dt Vt+1 


Transition 


Fig. 2.15: Markov chain transitions over three time steps 


The simplest Markov chain depends only on the most recent state (k = 1) and is 
represented as: 


P(r = Sil@r—1, 0-2, --91) = P(r = Silqi—1) (2.86) 


Such Markov chain for a fixed set of states given by S = (s1,5,..8,) can be 
represented using ann x n transition matrix A, ann X n matrix, where each element 
captures the transition probability as: 


Ai,j = P(g = Silqr-1 = Sj) (2.87) 
and an n-dimensional vector 7 which contains the initial state probabilities: 


n 


GHP Gis) et) Gl (2.88) 
1 


2.8.2 Discriminative Approach: Hidden Markov Models 


At times, the states of the Markov chain are hidden and not observed, but they 
produce effects which are observable. Such Markov chains are represented using 
hidden Markov models (HMM), as shown in Fig. 2.16, where new observed states 
represented by set V having fixed m elements, such as V = (11, v2,...Vm), are added 
to the previous Markov chain [Rab89]. 

The concepts required in HMM are: 


e Finite hidden states S = (s1,52,..5,) and finite observable states V = (v1, V2,.--Vm). 

e Fora fixed state sequence transition of length 7, given by Q = q1,q2,...qT, the 
observations are given by O = 01,02,...0r. 

e The parameters of HMM are A = (A,b, 7), where 


— The transition matrix A represents the transition probability from state s; to s; 
and is given by: 


Ajj = P(d = 8jlqr-1 = Si) (2.89) 
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Fig. 2.16: Hidden Markov models (HMM) 


— The vector b represents the probability of observing a state v;, given the hidden 
state s; and is independent of time, given by: 


Dk) =P xp = Vel = 53) (2.90) 


— The vector 7 represents the initial probability of the states and is given by: 
n 
1; = P(qi =8;) st. 01 =1 (2.91) 
I 


e The first-order HMMs have two independence assumptions: 


P(dr = Sil@r—1,0-2,--N1) = Pdr = silQi—1) (2.92) 


P(0; = Vj\O;—1, Or—25 0191, t—-15--91) = P(O; = Vj Or, 41) (2.93) 


HMMs can be used to answer various fundamental questions through different 
dynamic programming-based algorithms, of which we list a few below. 


1. Likelihood 
Given an HMM (A) and a sequence of observations O, what is the likeli- 
hood of the HMM generating the sequence; i.e., what is P(O|A)? A dynamic 
programming-based technique known as the forward algorithm, which stores 
intermediate values of states and its probabilities to finally build up the proba- 
bility of the whole sequence in an efficient manner, is generally employed. 

2. Decoding 
Given an HMM (A) and a sequence of observations O, what is the most 
likely hidden state sequence S that generated the observations? A dynamic 
programming-based technique known as the Viterbi algorithm, similar to the 
forward algorithm with minor changes, is used to answer this question. 

3. Learning: Supervised and Unsupervised 
Given a sequence of observations O and the state sequences $, what are the 
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parameters of the HMM that could generate it? This is a supervised learning 
problem and can be easily obtained from the training examples by computing 
different probabilities using the Count() function for the likelihood estimates. 
The individual cells A; ; of the transition probability matrix A can be estimated 
by counting the number of times the state s; is followed by the state s; as: 


Count (s;,5;) 
Ass PCs.) 2 2.94 

LJ (sj 151) Count (s;) ( ) 
The elements of the array b(k) can be estimated by counting the number of 
times the observed state vy, happens along with the hidden state s; and is given 
by: 


Count (vz, S;) 


bj(k) =P i 295 
jl ) (vg|s;) Count (s;) ( ) 
And the initial probabilities computed as: 
C t = Sj 
1; = P(q, =5;) = Count (qi = Si) (2.96) 


Count (q1) 


If only the sequence of observations O is provided and we need to learn the model 
that maximizes the probability of the sequence, the unsupervised learning problem 
is solved using a variation of the expectation maximization (EM) algorithm known 
as the Baum—Welch algorithm. 


2.8.3 Generative Approach: Conditional Random Fields 


Analogous to the relation between naive Bayes and logistic regression, conditional 
random field (CRF) has a similar relationship to HMM in the sequence modeling 
world. HMMs have shortcomings in effectively modeling the dependency between 
the inputs or the observed states and even the overlapping relationship between 
them. A linear chain CRF can be considered to be an undirected graph model equiv- 
alent of a linear HMM, as shown in Fig. 2.17 [LMPO1]. CRF is used mostly in su- 
pervised learning problems, though there are extensions that address unsupervised 
learning, whereas HMMs can be easily used for unsupervised learning. 

To illustrate CRFs, let us take as input a simple sentence Obama gave a speech 
at the Google campus in Mountain View. Each input word will be assigned a tag 
of either Person, Organization, Location, or Other, as illustrated in Fig. 2.18. The 
association of these tags to words is known as the named entity recognition problem 
in text processing. 
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Fig. 2.17: Relationship between generative and discriminative models in non- 
sequence and sequence-based data 


2.8.3.1 Feature Functions 


Feature functions are the basic units of CRF that capture the relationship between 
a pair of two consecutive outputs y;-1,y; with the entire input sequence x1, x2,...Xy 


as a real-valued output given by fj (yj—-1,Yi,X1-n,i). A simple binary feature function 
can be written for our example as: 


1 if y; = Location ,y;-; = Location and x; = View 
fiQi-1,¥i,X, 4) = : 
0 otherwise 


(2.97) 


2.8.3.2 CRF Distribution 


The entire labeled sequence of length n can be modeled as log-linear in terms of the 


feature functions fj(yi-1,yi,X1:n,4) and their weights A;, similar to logistic regres- 
sion, as given by: 


1 n 
P(y|x,A) = Z(x,a) (YY AOH-191%a)) (2.98) 


i=0 j 


where Z(x,/) is known as the normalization constant or the partition function and 
is given by: 


LxA) = ¥ exp (YY Hjle-197x.i)) (2.99) 


yeY i=0 Jj 
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Fig. 2.18: An information extraction example given with words as the inputs and the 
named entity tags as the outputs of the CRF 


2.8.3.3 CRF Training 


Similar to logistic regression, maximum likelihood (negative log-likelihood) can be 
used for learning the parameters (A ) for the CRF. Considering m training sequences 
D = (x! y!), (x*,y?)...(x™,y”), the total log-likelihood loss can be written as: 


L(A, D) = —log (T] mP(vix.4)) (2.100) 
k=] 


£(A,D) = ~ Yi oe (saa (SD Adv-1%68) (2.101) 


i=0 j 


The optimal parameter A,,;can be estimated using the equation below, where C 
acts as a prior or regularization constant. 


1 
Aom = argmin£(A,D) +C5|Al° (2.102) 
Xr 
The above equation is convex, and solving for the optimum will guarantee ob- 


taining the global optimum. If we rewrite the feature functions for simplicity as 
below and differentiate the above equation w.r.t ;: 


Fi(y,x) =D fi01-1,91% 8) (2.103) 
J 
dL(A,D 12 nt 
ee VFS) + DY Boye ay F(x) (2.104) 
OX; ae k=1 ) 
——_———— er” ae” 


observed mean feature value expected feature value given the Model 


It can be seen that this equation is not in a closed form, hence impossible to solve 
analytically. Various iterative algorithms such as L-BFGS or even gradient descent 
(as discussed above) are generally employed to obtain the solution. 
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2.9 Case Study 


We now take the reader through a real-world application of the concepts introduced 
in the chapter through a case study. The case study also equips the reader with 
necessary practical hands-on tools, libraries, methods, code, and analysis that will 
be useful for standard machine learning, as well as deep learning. 

We use the Higgs Boson challenge which was hosted by Kaggle. The challenge 
data is now available on ATLAS Higgs Challenge 2014. The case study is to classify 
the events into signals and background (any other event other than the signal). This 
is a binary classification problem. Most Kaggle challenges or hackathons provide 
training data with labels. The models that get submitted are then evaluated on the 
blind test data on a well-known metrics. Instead of the entire dataset, we have used 
a sample dataset which has training data size of 10,000 and a separate testing data 
of size 5000 with labels on which models will be evaluated. We will also assume the 
best model is selected based on the classification accuracy achieved on the test data, 
with metrics of accuracy, as the data is well balanced between the two classes. 

The goal of the case study is to use various techniques and methods illustrated 
in the chapter and compare the performance on the unseen test set. Various Python 
libraries such as Numpy, Scipy, Pandas, and scikit-learn, which are used extensively 
in machine learning, are introduced in the case study. 


2.9.1 Software Tools and Libraries 


First, we need to describe the main open source tools and libraries we will use for 
our case study. 


e Pandas (https://pandas.pydata.org/) is a popular open source implementation for 
data structures and data analysis. We will use it for data exploration and some 
basic processing. 

e scikit-learn (http://scikit-learn.org/) is a popular open source for various ma- 
chine learning algorithms and evaluations. We will use it only for sampling and 
creating datasets, machine learning implementations of linear and non-linear al- 
gorithms in our case study. 

e Matplotlib (https://matplotlib.org/) is a popular open source for visualization. 
We will use it to visualize performance. 


2.9.2 Exploratory Data Analysis (EDA) 


We use basic EDA to understand the characteristics of the data through univariate 
statistics, correlation analysis, and visualization. 
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One of the most important principles we have highlighted in the beginning of 
this book is to avoid the data snooping, 1.e., letting the test set labels influence 
the model or process decisions. Performing distribution analysis, statistical 
analysis on features, and confirming that the training and test datasets look 
similar in the splits are all considered to be valid exploratory analysis steps. 
An example of exploratory data analysis follows. 


1. Exploring the number of training and testing data in terms of features and num- 
ber of classes per set. 

2. Exploring the data types for each feature to determine whether it 1s categorical, 
continuous, ordinal, etc. and transforming them if needed based on the domain. 

3. Finding if the features have missing or unknown values and transforming them 
as needed. 

4. Understanding the distribution of each feature using scatter plots, histogram 
plots, box plots, etc., to see basic statistics of range, variance, and distribution 
for the features. This is illustrated in Fig. 2.19. 

5. Understanding similarity and differences between each of these statistics and 
plots for the training and testing data features. 

6. Calculating pairwise correlation between the features and correlation between 
features and the labels on training set. Plotting these and visualizing them (as in 
Fig. 2.19) gives a great aid to subject matter experts and data scientists. 


2.9.3 Model Training and Hyperparameter Search 


In this section we will go over some standard machine learning techniques per- 
formed on the data for learning effective models. 


2.9.3.1 Feature Transformation and Reduction Impact 


Understanding the impact of feature transformation and selection is one of the pre- 
liminaries for training a machine learning model. As discussed earlier in the chapter 
there are various dimensionality reduction techniques such as PCA, SVD, and oth- 
ers, various selection techniques, such as mutual information, chi-square, and others, 
and each of them have parameters that need to be tuned. These will impact the model 
and training algorithms, as well. The feature selection and dimensionality reduction 
techniques should be treated as hyperparameters that the model selection process 
will optimize. For the sake of brevity, in this section we will only analyze two dif- 
ferent feature selection algorithms. We will show two different feature selection and 
analyze them in this section. 
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Fig. 2.19: Exploratory data analysis plots. (a) Histogram plot for each feature and 
label on training data. (b) Pearson correlation across features and labels on training 
data 
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We first perform PCA with two components and plot these components with 
labels to see if the newly reduced training dataset with two components shows im- 
proved separation. We then increase the dimensions and plot the cumulative ex- 
plained variances by adding the variance captured by each transformed feature. Fig- 
ure 2.20 shows the two plots and reveals that the PCA transformation and reduction 
may not be useful for this dataset; the transformed features need as many dimensions 
as the original features to capture the variances. 
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Fig. 2.20: PCA analysis. (a) PCA of two transformed components as scatter plot on 
training data. (b) Cumulative explained variance for first 25 dimensions 


We also perform chi-square analysis on the data by first scaling the features to the 
range of |0, 1], as chi-square needs all features to be in positive range. The plots of 
scores and feature names are shown in Fig. 2.21. Only 16 features have scores above 
a threshold 0.1, and that may be a good subset to choose if reduction is pursued. 


Chi-Square Feature Selection Results 
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Fig. 2.21: Chi-square scores on the features plotted in the descending order 
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2.9.3.2 Hyperparameter Search and Validation 


We choose accuracy as the metric for performing search of hyperparameters for the 
algorithms and for comparing algorithms, because that is the metric that the test 
data prediction will be evaluated on. We will use cross-validation as our validation 
technique in the hyperparameter search. We will use five linear and non-linear al- 
gorithms for training: (a) perceptron, (b) logistic regression, (c) linear discriminant 
analysis, (d) naive Bayes, and (e) support vector machines (with RBF kernel). 

The code below highlights the hyperparameter search for SVM. 


1 from sklearn.svm import SVC 

2 Import numpy 

3 # gamma parameter in SVM 

4 gammas = numpy.array([l1, 0.1, 0.01, 0.001]) 

5 # C parameter for SVM 

6 c_values = numpy.array([100, 1, 0.1, 0.01]) 

7 # grid search for gamma and C 

s Svm_param-_grid = {’gamma’: gammas, °C’: c_values } 

9 # svm with rbf kernel 

10 Svatie— oN Cet nell bn a) 

1 scoring = ’accuracy’ 

2 # grid search 

3 grid = GridSearchCV(estimator=svm, param_grid=svm_param_grid , 
scoring=scoring ) 


The hyperparameters found and the validation results are given in Table 2.3. It is 
interesting to observe that the simplest linear perceptron has the least score and as 
the complexity of the model is increased to completely non-linear RBF kernel SVM, 
the performance improves. 


Table 2.3: Hyperparameters and validation scores for the classifiers 


Classifier Parameter and values Tenfold cross-validation 
AUC 

Perceptron a =0.001, maxIter = 100 0.54 

Logistic regression penalty=L,, C=0.1, maxIter = 100 0.61 

LDA tolerance=0.001 0.60 

Naive Bayes 0.60 

SVM (RBF) y =0.01, C = 100 0.63 


We next see if there is an impact of the feature selection techniques on a classifier 
by performing grid search for all the parameters of feature selection/reduction and 
classification. We use PCA and kbest with chi-square as the two feature selection 
techniques, and logistic regression as the classifier. By plotting the classification 
accuracy on various combinations, as shown in Fig. 2.22, we see that there is no 
impact of the feature selection and reduction on the validation performance. 
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Fig. 2.22: Validation accuracy of logistic regression with different feature selection 
techniques 


2.9.3.3 Learning Curves 


To see the impact of training examples and the variance across different validation 
runs, learning curves are plotted. They provide useful comparison of the training 
and validation metrics as a function of the training set size. We plot learning curves 
for the tuned logistic regression and SVM, as they were high-scoring algorithms 
(as shown in Fig. 2.23a and b). It can be observed that the SVM validation score 
increases monotonically with the size of the training set, demonstrating that more 
examples do improve the performance. It can also be observed that SVM has low 
variance across runs compared to logistic regression, which indicates robustness of 
the SVM classifier. 


2.9.4 Final Training and Testing Models 


Finally, we train the best models (with the best parameters) on the entire training 
data and run them on the test data for estimating the out-of-sample error (Table 2.4). 
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b Learning Curves for Tuned SVM 
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Fig. 2.23: (a) Learning curves for tuned logistic regression. (b) Learning curves for 
tuned SVM 
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Table 2.4: Hyperparameter and validation scores for classifiers 


Classifier Accuracy Precision Recall Fl-score 
Perceptron 0.55 0.55 0.56 0.56 
Logistic regression 0.61 0.61 0.62 0.61 
LDA 0.61 0.61 0.61 0.61 
Naive Bayes 0.60 0.61 0.60 0.60 
SVM (RBF) 0.64 0.64 0.65 0.65 


2.9.5 Exercises for Readers and Practitioners 


Some other interesting problems readers and practitioners can attempt on their own 
include: 


1. 


What is the impact of other feature transformations, such as normalization? 


2. What is the impact of other univariate feature selection methods, such as mutual 


information (selecting high-gain features)? 


. What is the impact of multivariate feature selection, such as correlation fea- 


ture selection (CFS) or minimum redundancy maximum relevance (mRmkR) that 
consider groups of features as opposed to individual features? 


. What is the impact of wrapper-based feature selection methods like recursive 


feature elimination (RFE)? 


. What is the impact of other non-linear learning methods, such as decision tree, 


gradient boosting, and random forest? 


. What is the impact of meta-learning techniques, such as cost-based learning, 


ensemble learning, and others? 
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Chapter 3 ®@ 
Text and Speech Basics fone 


3.1 Introduction 


This chapter introduces the major topics in text and speech analytics and machine 
learning approaches. Neural network approaches are deferred to later chapters. 

We start with an overview of natural language and computational linguistics. 
Representations of text that will form the basis of advanced analysis are introduced, 
and the core components in computational linguistics are discussed. Readers are 
guided through the broad range of applications that leverage these concepts. We 
investigate the topics of text classification and text clustering, and move onto appli- 
cations in machine translation, question answering, and automated summarization. 
In the latter part of this chapter, acoustic models and audio representations are in- 
troduced, including MFCCs and spectrograms. 


3.1.1 Computational Linguistics 


Computational linguistics focuses on applying quantitative and statistical methods 
to understand how humans model language, as well as computational approaches 
to answer linguistic questions. Its beginning in the 1950s coincided with the advent 
of computers. Natural language processing (NLP) is the application of computa- 
tional methods to model and extract information from human language. While the 
difference between the two concepts relates to underlying motivation, they are often 
used interchangeably. 

Computational linguistics can refer to written or spoken natural language. A writ- 
ten language is the symbol representation of a spoken or gestural language. There 
are plenty of spoken natural languages without a writing system, whereas there are 
no written natural languages that have developed on their own without a spoken as- 
pect. Natural language processing of written language is often called text analytics 
and of spoken language is called speech analytics. 
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Computational linguistics was considered in the past as a field within computer 
science. This has evolved considerably as computational linguistics became an in- 
terdisciplinary field of theoretical and applied science joining linguistics, psychol- 
ogy, neuroscience, philosophy, computer science, mathematics, and others. With the 
rise of social media, conversational agents, and personal assistants, computational 
linguistics is increasingly relevant in creating practical solutions to modeling and 
understanding human language. 


3.1.2 Natural Language 


A natural language is one that has evolved naturally through daily use by humans 
over time, without formal construction. They encompass a broad set that includes 
spoken and signed languages. By some estimates, there are about 7000 human lan- 
guages currently in existence, with the top ten representing 46% of the world’s pop- 
ulation [And12] (Fig. 3.1). 
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Fig. 3.1: Top 10 world languages 


Natural language is inherently ambiguous, especially in its written form. To un- 
derstand why this is so, consider that the English language has about 170,000 words 
in its vocabulary but only about 10,000 are commonly used day-to-day [And12]. 
Human communications have evolved to be highly efficient, allowing for reuse of 
shorter words whose meanings are resolved through context. This lessens the com- 
putational burden and frees up parts of the human brain for other important tasks. 
At the same time, this ambiguity makes it inherently hard for computers to pro- 
cess and understand natural language. This difficulty extends to aspects of language 
such as sarcasm, irony, metaphors, and humor. In any language, ambiguities exist 
in word sense, grammatical structure, and sentence structure. We will discuss below 
methods that deal with each of these ambiguities. 
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3.1.3 Model of Language 


When we analyze a natural language, we often group language characteristics into a 
set of categories. For text analysis, these categories are morphology, lexical, syntax, 
semantics, discourse, and pragmatics. Morphology refers to the shape and internal 
structure of a word. Lexical refers to the segmentation of text into meaningful units 
like words. Syntax refers to the rules and principles applied to words, phrases, and 
sentences. Semantics refers to the context that provides meaning within a sentence. 
It is semantics that provides the efficiency of human language. Discourse refers to 
conversations and the relationships that exist among sentences. Pragmatics refers 
to external characteristics such as the intent of the speaker to convey context. For 
speech analysis, we typically group language characteristics into the categories of 
acoustics, phonetics, phonemics, and prosodics. Acoustics refers to the methods we 
use to represent sounds. Phonetics refers to how sounds are mapped to phonemes 
that serve as base units of speech. Phonemics, also known as phonology, refers to 
how phonemes are used in a language. Prosodics refers to non-language characteris- 
tics that accompany speech such as tone, stress, intonation, and pitch. In the follow- 
ing sections, we will discuss each in greater detail from a computational linguistics 
perspective. As we will see in the subsequent chapters, these linguistic characteris- 
tics can be used to provide a rich set of representations that are useful for machine 
learning algorithms (Table 3.1). 


Table 3.1: Language analysis categories 


Morphology |Shape and structure of words 
Lexical Segmenting text into words 
Syntax Rules for words in a sentence 
Semantics |Meaning of words in a sentence 
Discourse |Meaning among sentences 
Pragmatics |Meaning through speaker intent 
Acoustics |Representations of sound 
Phonetics |Mapping sound to speech 
Phonemics |Mapping speech to language 
Prosodics __|Stress, pitch, tone, rhythm 


Because of the dependencies inherent in these categories, we often model a nat- 
ural language as a hierarchical collection of linguistic characteristics as in Fig. 3.2. 

This is often called a synchronic model of language—that is, a model that is 
based on a snapshot in time of a language [Rac14]. Some linguists have argued that 
such a synchronic model does not fit modern living languages that constantly evolve, 
preferring diachronic models that can address changes in time. The complexity of 
diachronic models makes them difficult to handle, however, and synchronic models 
as originally championed by Swiss linguist Ferdinand de Saussure at the turn of 
the twentieth century are widely adopted today [Sau16]. The computational and 
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Fig. 3.2: Model of natural language 


statistical methods that can be applied to each component in the synchronic model 
form the basis of natural language processing. 

Natural language processing seeks to map language to representations that cap- 
ture morphological, lexical, syntactic, semantic, or discourse characteristics that can 
then be processed by machine learning methods. The choice of representation can 
have significant impact on later tasks and will depend on the selected machine learn- 
ing algorithm for analysis. 

In the following sections of this chapter, we will dive into these representations 
to better understand their role in linguistics and purpose in natural language pro- 
cessing. We introduce to readers the most common text-based representations, and 
leave audio representations to later sections. 


3.2 Morphological Analysis 


All natural languages have systematic structure, even sign language. In linguistics, 
morphology is the study of the internal structure of words. Literally translated from 
its Greek roots, morphology means “‘the study of shape.” It refers to the set of rules 
and conventions used to form words based on their context, such as plurality, gender, 
contraction, and conjugation. 
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Words are composed of subcomponents called morphemes, which represent the 
smallest unit of language that holds independent meaning. Morphemes may be com- 
ponents of the word that relate to its meaning, grammatical role, or derivation. Some 
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morphemes are words by themselves, such as “run,” “jump,” or “hide.” Other words 
are formed by a combination of morphemes, such as “runner,” “jumps,” or “unhide.” 
Some languages, like English, have relatively simple morphologic rules for combin- 
ing morphemes. Others, like Arabic, have a rich set of complex morphologic rules 
[HEH 12]. 

To humans, understanding the morphological relations between the words “walk, 
walking, walked” is relatively simple. The plurality of possible morpheme combina- 
tions, however, makes it very difficult for computers to do so without morphological 
analysis. Two of the most common approaches are stemming and lemmatization, 
which we describe below. 


3.2.1 Stemming 


Often, the word ending is not as important as the root word itself. This is especially 
true of verbs, where the verb root may hold significantly more meaning than the 
verb tense. If this is the case, computational linguistics applies the process of word 
stemming to convert words to their root form (e.g., base morpheme in meaning). 
Here are some stemming examples: 


works — work 
worked — work 
workers — work 


While you and I can easily recognize that each of these is related to meaning, it 
would be very difficult for a computer to do so without stemming. It is important to 
note that stemming can introduce ambiguity, as evident in the third example above 
where “workers” has the same stem as “works,” but both words have different mean- 
ings (people versus items). On the other hand, the advantage of stemming is that it is 
generally robust to spelling errors, as the correct root may still be inferred correctly. 


One of the most popular stemming algorithms in NLP is the Porter stem- 
mer, devised by Martin Porter in 1980. This simple and efficient method uses 
a series of 5 steps to strip word suffixes and find word stems. Open-source 
implementations of the Porter stemmer are widely available. 
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3.2.2 Lemmatization 


Lemmatization is another popular method used in computational linguistics to re- 
duce words to base forms. It is closely related to stemming in that it is an algorithmic 
process that removes inflection and suffixes to convert words into their lemma (1.e., 
dictionary form). Some examples of lemmatization are: 


works — works 
worked — work 
workers — worker 


Notice that the lemmatization results are very similar to those of stemming, except 
that the results are actual words. Whereas stemming is a process where meaning 
and context can be lost, lemmatization does a much better job as evident in the third 
example above. Since lemmatization requires a dictionary of lexicons and numerous 
lookups, stemming is faster and the generally more preferred method. Lemmatiza- 
tion is also extremely sensitive to spelling errors, and may require spell correction 
as a preprocessing step. 


3.3 Lexical Representations 


Lexical analysis is the task of segmenting text into its lexical expressions. In natu- 
ral language processing, this means converting text into base word representations 
which can be used for further processing. In the next few subsections, we provide an 
overview of word-level, sentence-level, and document-level representations. As the 
reader will see, these representations are inherently sparse, in that few elements are 
non-zero. We leave dense representations and word embeddings to a later chapter. 

Words are the elementary symbols of natural language. They are not the most 
elementary, as words can consist of one or more morphemes. In natural language 
processing, often the first task is to segment text into separate words. Note that we 
say “often” and not “always.” As we will see later, sentence segmentation as a first 
step may provide some benefits, especially in the presence of “noisy” or ill-formed 
speech. 


3.3.1 Tokens 


The computational task of segmenting text into relevant words or units of meaning 
is called tokenization. Tokens may be words, numbers, or punctuation marks. In 
simplest form, tokenization can be achieved by splitting text using whitespace: 


The rain in Spain falls mainly on the plain. 


|The], |rain], in|, |Spain], |falls|, |mainly|, Jon], |the 
























































9 9 9 9 9 9 plain 3 bs | 


3.3 Lexical Representations 93 


This works in most cases, but fails in others: 


Don’t assume we’re going to New York. 
assume], |we}, |’|, re}, |going], |to}, |New | 


Notice that “New York” is typically considered a single token, since it refers to a 
specific location. To compound problems, tokens can sometimes consist of multiple 
words (e.g., “he who cannot be named’’). There are also numerous languages that 
do not use any whitespace, such as Chinese. 


POST 7 AY ber PEER ° 


Tokenization serves also to segment sentences by delineating the end of one sen- 
tence and beginning of another. Punctuation plays an important role in this task, but 
unfortunately punctuation is often ambiguous. Punctuation like apostrophes, hy- 
phens, and periods can create problems. Consider the multiple use of the period in 
this sentence: 
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Dr. Graham poured 0.5ml into the beaker. 
|Dr.|, |Graham poured 0.|, |5ml into the beaker. | 
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A simple punctuation-based sentence splitting algorithm would incorrectly seg- 
ment this into three sentences. There are numerous methods to overcome this am- 
biguity, including augmenting punctuation-based with hand engineered rules, using 
regular expressions, machine learning classification, conditional random field, and 
slot-filling approaches. 


3.3.2 Stop Words 


Tokens do not occur uniformly in English text. Instead, they follow an exponen- 
tial occurrence pattern known as Zipf’s law, which states that a small subset of 
tokens occur very often (e.g., the, of, as) while most occur rarely. How rarely? Of 
the 884,000 tokens in Shakespeare’s complete works, 100 tokens comprise over half 
of them [Gui-+06]. 

In the written English language, common functional words like “the,” “a,” or 
“is” provide little to no context, yet are often the most frequently occurring words 
in text as seen in Fig. 3.3. By excluding these words in natural language processing, 
performance can be significantly improved. The list of these commonly excluded 
words is known as a stop word list. 


3.3.3 N-Grams 


While word level representations are sometimes useful, they do not capture the re- 
lationship with adjacent words that can help provide grammar or context. For in- 
stance, when working with individual tokens, there is no concept of word order. 
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Fig. 3.3: Zipf’s law as it applies to the text from the complete works of Shakespeare 


One simply considers the existence and occurrence of tokens within a piece of text. 
This is known as a bag-of-words model (see Eq. (3.1)), and is based on a Markov 
assumption. A phrase containing L tokens would be predicted with probability: 


L 


P(wiw2...w) = [| [Pi (3.1) 
i=1 


Instead of considering individual tokens (termed unigrams), another approach 
would be to consider consecutive tokens. This is called a bigram approach, where a 
sentence with L tokens would yield L — 1 bigrams in the form: 


L 


P(wiw2...wr) = ] [ Pwilwi-1) (3.2) 
1I=2 


Notice that bigrams effectively capture the local word order of two consecutive 
tokens (e.g., “lion king” 1s not the same as “king lion’). We can extend this concept 
to capture lengths of n tokens, known as n-grams: 


L 
P(wiw2...Wz) = [ [2 (ilwi-iwi-2--- wi-n) (3.3) 


i=n 


It is important to note that for higher values of n, n-grams become extremely infre- 
quent. This will adversely impact count-based natural language processing methods 
as we see later. 


3.3.4 Documents 


When working with a corpus of documents, there are a number of document repre- 
sentation methods used in computational linguistics. Some are token based, multi- 
token based, or character based. Many are sparse representations, while others are 
dense. We discuss two of the most common ones in this section. 
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3.3.4.1 Document-Term Matrix 


A document-term matrix is a mathematical representation of a set of documents. 
In the document-term matrix shown in Table 3.2, rows correspond to documents in 
the collection and columns correspond to unique tokens. The number of columns 
is equal to the unique token vocabulary across all documents. There are numerous 
ways to determine the value of the elements of this matrix, and we discuss two 
below. 


3.3.4.2 Bag-of-Words 


One common approach is to set each element of the document-term matrix equal 
to the frequency of word occurrence within each document. Imagine representing 
each document as a list of counts of unique words. This is known as a bag-of-words 
approach [PT13]. Obviously, there is significant information loss by simply using 
a document vector to represent an entire document, but this is sufficient for many 
computational linguistics applications. This process of converting a set of docu- 
ments into a document-term matrix where each element is equal to word occurrence 
is commonly known as count vectorization. 


Table 3.2: Document-term matrix 





You may remember that the most frequent words in the English vocabulary are 
generally less significant than rarer words for discerning meaning. Unfortunately, 
a bag-of-words model weighs words based on occurrence. In practice, a stop word 
filter is used to remove the most common words prior to count vectorization. Even 
in doing so however, we will still find that rare words occurring across a set of 
documents are often the most meaningful. 
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3.3.4.3 TFIDF 


TFIDF is a method that provides a way to give rarer words greater weight by setting 
each document-term matrix element equal to the value w of multiplying the term 
frequency (TF) by the inverse document frequency (IDF) of each token [Ram99]: 


w=tf x idf (3.4) 


= (1-+log(TF,)) x log (= 3.5) 


l 


Here, we have used the definitions of term frequency as the logarithmically scaled 
ratio of the count of term ¢ occurring in document d versus the total number of terms 
in document d, and inverse document frequency as the logarithmically scaled ratio 
of the total number of documents vs the count of documents with term fr. 

Because of the tf factor, the TFIDF value for a token increases proportionally 
to the number of times it appears in a document. The idf factor reduces the TFIDF 
value for a token based on the frequency of occurrence across all documents. Cur- 
rently, TFIDF is the most popular weighting method, with over 80% of current dig- 
ital libraries using it in production (Table 3.3). 


Table 3.3: TFIDF matrix 
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3.4 Syntactic Representations 


Syntactic representations of natural language deal with the grammatical structure 
and relation of words and phrases within sentences. Grammar plays an inherent role 
in most languages to help provide context. Computational linguistics utilizes differ- 
ent approaches to extract these context clues such as part-of-speech tags, chunking, 
and dependency parsing. They serve well as features for downstream natural lan- 
guage processing tasks. 
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3.4.1 Part-of-Speech 


A part-of-speech (POS) is a class of words with grammatical properties that play 
similar sentence syntax roles. It is widely accepted that there are 9 basic part-of- 
speech classes (you may remember them from grade school) (Table 3.4). 


Table 3.4: Basic part-of-speech labels 
N Dog, cat 


ADJ Green, short 
ADV Quickly, likely 
P By, for 

CON And, but 

PRO You, me 

INT Wow, lol 





There are numerous POS subclasses in English, such as singular nouns (NN), 
plural nouns (NNS), proper nouns (NP), or adverbial nouns (NR). Some languages 
can have over 1000 parts of speech [PDM11]. Due to the ambiguity of the English 
language, many English words belong to more than one part-of-speech category 
(e.g., “bank” can be a verb, noun, or interjection), and their role depends on how 
they are used within a sentence. It can be difficult to identify which category the 
word belongs. Part-of-speech tagging 1s the process of predicting the part-of-speech 


PRO V N N P A NN 
She sells sea shells by the sea shore. 


PRO V N CON P N 
We like eels except as meals. 


Fig. 3.4: POS tagging 


category for each word in the text based on its grammatical role and context within 
a sentence (Fig. 3.4) [DeR88]. POS tagging algorithms fall into two distinct groups: 
rules based and statistical methods. 


The Brown corpus was the first major collection of English texts used in 
computational linguistics research. It was developed by Henry Kué era and 
W. Nelson Francis at Brown University in the mid-1960s and consists of 
over a million words of English prose extracted from 500 randomly chosen 
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publications of 2000 or more words. Each word in the corpus has been POS- 
tagged meticulously using 87 distinct POS tags. The Brown corpus is still 
commonly used as a gold set to measure the performance of POS-tagging 
algorithms. 


3.4.1.1 Rules Based 


The earliest part-of-speech tagging approaches were rules based and depended on 
dictionaries, lexicons, or regular expressions to predict possible POS labels for each 
word. Where ambiguities arose, ad-hoc rules were often incorporated to make POS 
tag decisions. This made rules-based systems brittle. For instance, a rule could de- 
clare that a word that follows an adverb and comes before a conjunction should 
be a noun, except it should be verb if it is not a singular common noun. The best 
rules-based POS tagger to date achieved only 77% accuracy on the Brown corpus 
[BM04]. 


3.4.1.2 Hidden Markov Models 


Since the 1980s, hidden Markov models (HMMs) introduced in the previous chap- 
ter became popular as a better approach to POS tagging. HMMs are better able to 
learn and capture the sequential nature of grammar than rules-based methods. To 
understand this, consider the POS tagging problem, where we seek to find the most 
probable tag sequence 7” for a given sequence of n words w”: 


f” = argmax P (t"|w") (3.6) 
t? 


a argmax | | P (wilti) P (ti|ti_1) (3.7) 
i gel 


The equation above represents an HMM model where the Markov states are the 
words w” and the hidden states t” are the POS tags. The transition matrices can be 
directly computed from text data. It turns out that assigning the most common tag 
to each known word can achieve fairly high accuracy. To account for more ambigu- 
ous word sequences, higher-order HMMs can also be used for larger sequences by 
leveraging the Viterbi algorithm. These higher-order HMMs can achieve very high 
accuracy, but they require significant computation load since they must explore a 
larger set of paths. 

Beyond HMMs, machine learning methods have gained huge popularity for POS 
tagging tasks, including CRF, SVM, perceptrons, and maximum entropy classifica- 
tion approaches. Most now achieve accuracy above 97%. In the subsequent chapters, 
we will examine deep learning approaches that hold even greater promise to POS 
tag prediction. 
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3.4.2 Dependency Parsing 


In a natural language, grammar is the set of structural rules by which words and 
phrases are composed. Every sentence in English follows a certain pattern. These 
patterns are called grammars and express a relation between a (head) word and its 
dependents. Most natural languages have a rich set of grammar rules, and knowl- 
edge of these rules helps us disambiguate context in a sentence. Consider the fact 
that without grammar, there would be practically unlimited possibilities to combine 
words together. 

Parsing is the natural language processing task of identifying the syntactic rela- 
tionship of words within a sentence, given the grammar rules of a language [CovO1]. 
There are two common ways to describe sentence structure in natural language. The 
first is to represent the sentence by its constituent phrases, recursively down to the 
individual word level. This is known as constituent grammar parsing, which maps 
a sentence to a constituent parse tree (Fig. 3.5). The other way is to link individual 
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Black swans like clear ponds 


Fig. 3.5: Constituent grammar parsing 


words together based on their dependency relationship. This is known as depen- 
dency grammar parsing which maps a sentence to a dependency parse tree (Fig. 
3.6). Dependency is a one-to-one correspondence, which means that there is exactly 


ADJ N V ADJ N 
Black swans like clear ponds 


Fig. 3.6: Dependency grammar parsing 
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one node for every word in the sentence. Notice that the links are directional be- 
tween two words in a dependency parse tree, pointing from the head word to the 
dependent word to convey the relationship. Constituent and dependency parse trees 
can be strongly equivalent. The appeal of dependency tree is that the links closely 
resemble semantic relationships. 

Because a dependency tree contains one node per word, the parsing can be 
achieved with computational efficiency. Given a sentence, parsing algorithms at- 
tempt to find the most likely derivation from its grammatical rules. If the sentence 
exhibits structural ambiguity, more than one derivation is possible. Parsers are sub- 
divided into two general approaches. Top-down parsers use a recursive algorithm 
with a back-tracking mechanism to descend from the root down to all words in the 
sentence. Bottom-up parsers start with the words and build up the parse tree based 
on a shift/reduce or other algorithm. Top-down parsers will derive trees that will 
always be grammatically consistent, but may not align with all words in a sentence. 
Bottom-up approaches will align all words, but may not be always make grammati- 
cal sense. 


3.4.2.1 Context-Free Grammars 


Grammar, as stated above, is the set of rules that define the syntactic structure and 
pattern of words in a sentence. Because these rules are generally fixed and absolute, 
a context-free grammar (CFG) can be used to represent the grammatical rules of a 
language [JM09]. Context-free grammars typically have a representation known as 
Backus—Naur form and are able to capture both constituency and ordering of words 
in a sentence. 

Unfortunately, because of the inherent ambiguity of language, CFG may generate 
multiple possible parse derivations for a given sentence. Probabilistic context-free 
grammars (PCFG) deal with this issue by ranking possible parse derivations and 
selecting the most probable, given a set of weights learned from a distribution of 
text. PCFGs generally outperform CFGs, especially for languages like English. 


3.4.2.2 Chunking 


For some applications, a full syntactic parse with its computational expense may 
not be needed. Chunking, also called shallow parsing, is a natural language pro- 
cessing task which joins words into base syntactic units rather than generating a 
full parse tree. These base syntactic units are often referred to as “chunks.” For ex- 
ample, given a sentence, we would like to identify just the base noun-phrases (..e., 
phrases that serve the same grammatical function as a noun and do not contain other 
noun-phrases): 


| yp Phe winter season| is depressing for | ypmany people| 
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Chunking is often performed by a rules-based approach where regular expressions 
and POS-tags are used to match fixed patterns, or with machine learning algorithms 
such as SVM [KMO1]. 


3.4.2.3 Treebanks 


A treebank is a text corpus that has been parsed and annotated for syntactic structure. 
That is, each sentence in the corpus has been parsed into its dependency parse tree. 
Treebanks are typically generated iteratively using a parser algorithm and human 
review [Mar+94] [Niv+16]. Often, treebanks are built on top of a corpus that has 
already been annotated with part-of-speech tags. The creation of treebanks revolu- 
tionized computational linguistics, as it embodied a data-driven approach to gener- 
ating grammars that could be reused broadly in multiple applications and domains. 
Statistical parsers trained with treebanks are able to deal much better with structural 
ambiguities [Bel+17]. 


The Penn Treebank is the de facto standard treebank for parse analysis and 
evaluation. Initially released in 1992, it consists of a collection of articles 
from Dow Jones News Service written in English, of which 1 million words 
are POS-tagged and 1.6 million words parsed with a tagset. An improved 
version of the Penn Treebank was released in 1995. 


Universal Dependencies is a collection of over 100 treebanks in 60 
languages, created with the goal of facilitating cross-lingual analysis 
[McD+13]. As its name implies, the treebanks are created with a set of 
universal, cross-linguistically consistent grammatical annotations. The first 
version was released in October of 2014. 


3.5 Semantic Representations 


Whereas lexical and syntactic analyses capture the form and order of language, they 
do not associate meaning with words or phrases. For instance, labeling “dog” as 
a noun gives us no clue what a “dog” is. Semantic representations give a sense of 
meaning to words and phrases. They attach roles to word chunks such as person, 
place, or amount. Semantic analysis is interested in understanding meaning primar- 
ily in terms of word and sentence relationships. There are several different kinds of 
semantic relations between words (see Table 3.5). 
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Table 3.5: Semantic relations between words 


Synonymy | Words spelled differently but have the same meaning 
Antonymy | Words having the opposite meanings to each other 
Hyponymy |Generic term and a specific instance of it 
Hypernymy | Broad category that includes other words 

Meronymy |Constituent part or a member of something 
Holonymy |Semantic relation between a whole and its parts 
Homonymy | Words with identical forms but different meanings 
Polysemy | Words with two or more distinct meanings 


Homonymy and polysemy are very similar, and the key difference is that a pol- 
ysemous word is one word with different meanings, while homonyms are different 
words that share a shape (usually both spelling and pronunciation). For example, 
most people would consider the noun tire (the wheels on your car) and the verb tire 
(what happens when you exercise) to be completely different words, even though 
they look and sound the same. They’re homonyms. On the other hand, most people 
agree that there is only one word offense, but that it has various meanings which are 
all related: the attacking team, a criminal act, a feeling of being insulted, etc. 


3.5.1 Named Entity Recognition 


Named entity recognition (NER) is a task in natural language processing that seeks 
to identify and label words or phrases in text that refer to a person, location, or- 
ganization, date, time, or quantity. It is a subtask of information extraction and is 
sometimes called entity extraction. Due to the reuse of words and ambiguity of nat- 
ural language, entity recognition is hard. Take, for instance, the word “Washington”’ 
which may be a reference to a city, a state, or a president. It would be difficult to dis- 
ambiguate this word without the context of real-world knowledge. Ambiguities can 
exist in two ways: different entities of the same type (George Washington and Wash- 
ington Carver are both persons) or entities of different types (George Washington or 
Washington state) (Table 3.6). 


Table 3.6: Named entities 


Person George Washington 
Location Washington State 
Organization |General Motors 
Date Fourth of July 
Time Half past noon 
Quantity Four score 
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While regular expressions can be used to some extent for name entity recog- 
nition [HN14], the standard approach is to treat it as a sequence labeling task or 
HMM in similar fashion to POS-tagging or chunking [AL13]. Conditional ran- 
dom fields (CRFs) have shown some success in named entity recognition. How- 
ever, training a CRF model typically requires a large corpus of annotated training 
data [TKSDM03c]. Even with a lot of data, name entity recognition is still largely 
unsolved. 


3.5.2 Relation Extraction 


Relationship extraction is the task of detecting semantic relationships of named en- 
tity mentions in text. For instance, from the following sentence, 


President George Bush and his wife Laura attended the Congressional Dinner. 


we can extract a set of relations between the entities: George Bush, Laura, Congres- 
sional Dinner (Table 3.7). Note that the second relation (George Bush is married 


Table 3.7: Entity relations 


Laura married to George Bush Person — Person 
George Bush married to Laura Person — Person 
George Bush at Congressional Dinner| Person — Location 
President George Bush Org — Person 


to Laura) logically follows from the first (Laura is married to George Bush), even 
though it may not be explicitly stated in the text. The common approach to relation 
extraction is to divide it into subtasks: 


1. Identify any relations between entities 
2. Classify the identified relations by type 
3. Derive logical/reciprocal relations. 


The first subtask is typically treated as a classification problem, where a binary 
decision is made as to whether a relation is present between any two entities within 
the text. The second subtask is a multiclass prediction problem. Naive Bayes and 
SVM models have been successfully applied to both subtasks [BBO7, Hon05]. The 
last subtask is a logical inference task. Relation extraction plays an important role 
in question answering tasks. 


104 3 Text and Speech Basics 


3.5.3 Event Extraction 


Events are mentions within text that have a specific location and instance or interval 
in time associated with them. The task of event detection is to detect the mentions 
of events in text and to identify the class to which they belong. Some examples of 
events are: the Superbowl, The Cherry Blossom festival, and our 25th wedding an- 
niversary celebration. Both rules-based and machine learning approaches for event 
detection are similar to those for relationship extraction [Rit+12, MSMI1]. Such 
approaches have had mixed success due to the need for external context and the 
importance of temporal relations. 


3.5.4 Semantic Role Labeling 


Semantic role labeling (SRL), also known as thematic role labeling or shallow se- 
mantic parsing, is the process of assigning labels to words and phrases that indicate 
their semantic role in the sentence. A semantic role is an abstract linguistic construct 
that refers to the role that a subject or object takes on with respect to a verb. These 
roles include: agent, experiencer, theme, patient, instrument, recipient, source, ben- 
eficiary, manner, goal, or result. 

Semantic role labeling can provide valuable context [GJO2], whereas syntactic 
parsing can only provide grammatical structure. The most common approach to 
SRL is to parse a set of target sentences to identify predicates [PWM08]. For each of 
these predicates, a machine learning classifier trained on a dataset such as PropBank 
or FrameNet is used to predict a semantic role label. These labels serve as highly 
useful features for further tasks such as text summarization or question answering 
[JNO8, BFL98b]. 


PropBank (the Proposition Bank) is a corpus of Penn Treebank sentences 
fully annotated with semantic roles, where each of the roles is specific to an 
individual verb sense. Each verb maps to a single instance in PropBank. The 
corpus was released in 2005. 


FrameNet is another corpus of sentences annotated with semantic roles. 
Whereas PropBank roles are specific to individual verbs, FrameNet roles 
are specific to semantic frames. A frame is the background or setting in 
which a semantic role takes place—it provides a rich set of contexts for the 
roles within the frame. FrameNet roles have much finer grain than those of 
PropBank. FrameNet contains over 1200 semantic frames, 13,000 lexical 
units, and 202,000 example sentences. 
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3.6 Discourse Representations 


Discourse analysis is the study of the structure, relations, and meaning in units of 
text that are longer than a single sentence. More specifically, it investigates the flow 
of information and meaning by a collection of sentences taken as a whole. Discourse 
presumes a sender, receiver, and message. It encompasses characteristics such as the 
document/dialogue structure, topics of discussion, cohesion, and coherence of the 
text. Two popular tasks in discourse analysis are coreference resolution and dis- 
course segmentation. 


3.6.1 Cohesion 


Cohesion is a measure of the structure and dependencies of sentences within dis- 
course. It is defined as the presence of information elsewhere in the text that supports 
presuppositions within the text. That is, cohesion provides continuity in word and 
sentence structure. It is sometimes called “surface level” text unity, since it provides 
the means to link structurally unrelated phrases and sentences together [BNOO]. 
There are six types of cohesion within text: coreference, substitution, ellipsis, con- 
junction, reiteration, and collocations. Of these, coreference is by far the most pop- 
ular, as observed in the relation between “Jack” and “He” in the two sentences: 


Jack ran up the hill. 
He walked back down. 


3.6.2 Coherence 


Coherence refers to the existence of semantic meaning to tie phrases and sentences 
together within text. It can be defined as continuity in meaning and context, and 
usually requires inference and real-world knowledge. Coherence is often based on 
conceptual relationships implicitly shared by both the sender and receiver that are 
used to construct a mental representation of the discourse [WG05]. An example of 
coherence can be seen in the following example which presumes knowledge that a 
bucket holds water: 


Jack carried the bucket. 
He spilled the water. 


3.6.3 Anaphora/Cataphora 


Anaphora refers to the relation between two words or phrases where the interpre- 
tation of one, called an anaphor, is determined by the interpretation of a word that 
came before, called an antecedent. Cataphora is where the interpretation of a word 


106 3 Text and Speech Basics 


is determined by another word that came after in the text. Both are important char- 
acteristics of cohesion in discourse. 


Anaphora: The court cleared its docket before adjoining. 
Cataphora: Despite his carefulness, Jack spilled the water. 


3.6.4 Local and Global Coreference 


The linguistics process by which anaphors are linked with their antecedents is 
known as coreference resolution. It 1s a well-studied problem in discourse. When 
this occurs within a document, it is commonly termed local coreference. If this 
occurs across documents, it is termed global coreference. Essential when disam- 
biguating pronouns and connecting them with the right individual mentions within 
text, coreference also plays an important role in entity resolution [Lee+13, Sin+13]. 

Coreference resolution can be considered a classification task, and algorithms 
for resolving coreference range in accuracy from 70% for named entities to 90% for 
pronouns [PPO9]. 


3.7 Language Models 


A statistical language model is a probability distribution over sequences of words. 
Given such a sequence, say of length L, it assigns a probability to the whole se- 
quence. In other words, it tries to assign a probability to each possible sequence 
of words or tokens. Given a set of L tokens wj,w2,...,wz, a language model will 
predict the probability P(W): 


P(W) = P(w1w2...wz) (3.8) 


How is this useful? A language model is one that tries to predict how frequent a 
phrase occurs within the natural use of a language. Having a way to estimate the 
relative likelihood of different phrases is useful in many natural language processing 
applications, especially ones that generate text as an output. For instance, language 
models can be used for spell correction by predicting a word wy given all of the 
previous words before it: 


P (wr | wr-1wr-2---W1) (3.9) 


Language modeling is used in speech recognition, machine translation, part-of- 
speech tagging, parsing, handwriting recognition, information retrieval, and other 
applications. 
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3.7.1 N-Gram Model 


We can extend this to the general case of n-grams. We assume that the probability 
of observing the ith word w; in the context history of the preceding words can be 
approximated by the probability of observing it in the shortened context history of 
the preceding words (nth order Markov property). 

A unigram model used in information retrieval can be treated as the combination 
of several one state finite automata. It splits the probabilities of different terms in a 


context, e.g., from: 
L 


P(wiw2...wr) = |] Pi) (3.10) 
i=1 
The words bigrams and trigrams denote n-gram model language models with n = 2 
and n = 3, respectively. The conditional probability can be calculated from n-gram 
model frequency counts: 


a 
P(wiw2...wt) = []? wilwi-iwi-2 ...Wi-n) (3.11) 


i=n 


3.7.2 Laplace Smoothing 


The sparsity of n-grams can become a problem, especially if the set of documents 
used to create the n-grams language model is small. In those cases, it is not uncom- 
mon for certain n-grams to have zero counts in the data. The language model would 
assign zero probability to these n-grams. This creates a problem when these n-grams 
occur in test data. Because of the Markov assumption, the probability of a sequence 
is equal to the product of the individual probabilities of the n-grams. A single zero 
probability n-gram would set the probability of the sequence to be zero. 

To overcome this problem, it is common to use a technique called smoothing. 
The simplest smoothing algorithm initializes the count of every possible n-gram at 
1. This is known as Laplace or add-one smoothing, and guarantees that there will al- 
ways be a small probability that any n-gram occurs. Unfortunately, as n-gram spar- 
sity grows, this approach becomes less useful as it dramatically shifts occurrence 
probabilities. 

If a word was never seen in the training data, then the probability of that sentence 
is zero. Clearly this is undesirable, so we apply Laplacian smoothing to help deal 
with that. We add | to every count so it’s never zero. To balance this, we add the 
number of possible words to the divisor, so the division will never be greater than 1. 

Laplace smoothing is a simple, inelegant approach that provides modest improve- 
ments to results for like text classification. In general, we can use a pseudocount 
parameter a > 0: 

Xi +Q 


o; = ———. 312 
N+ad ( ) 
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A more effective and wisely used method is Kneser—-Ney smoothing, due to its use 
of absolute discounting by subtracting a fixed value from the probability’s lower 
order terms to omit n-grams with lower frequencies: 


max (c(wi_1w;) — 6,0) 


Paps (wi|Wi-1) = Scan) 
WwW T— 


+ OPabs (Wi) (3.13) 


3.7.3 Out-of- Vocabulary 


Another serious problem for language models arise when the word is not in the vo- 
cabulary of the model itself. Out-of-vocabulary (OOV) words create serious prob- 
lems for language models. In such a scenario, the n-grams that contain an out-of- 
vocabulary word are ignored. The n-gram probabilities are smoothed over all the 
words in the vocabulary even if they were not observed. 

To explicitly model the probability of out-of-vocabulary words, we can intro- 
duce a special token (e.g., <unk>) into the vocabulary. Out-of-vocabulary words in 
the corpus are effectively replaced with this special <unk> token before n-grams 
counts are accumulated. With this option, it is possible to estimate the transition 
probabilities of n-grams involving out-of-vocabulary words. By doing so, however, 
we treat all OOV words as a single entity, ignoring the linguistic information. 

Another approach is to use approximate n-gram matching. OOV n-grams are 
mapped to the closest n-gram that exists in the vocabulary, where proximity is based 
on some semantic measure of closeness (we will describe word embeddings in more 
detail in a later chapter). 

A simpler way to deal with OOV n-grams 1s the practice of backoff, based on the 
concept of counting smaller n-grams with OOV terms. If no trigram is found, we 
instead count bigrams. If no bigram found, use unigrams. 


3.7.4 Perplexity 


In information theory, perplexity measures how well a probability distribution pre- 
dicts a sample. Perplexity is a commonly used measure to evaluate the performance 
of a language model. It measures the intrinsic quality of an n-gram model as a func- 
tion of the probability P(W ) that the model predicts a test sequence W = wi w2...wy 
can occur, given by: 


P(W) = P(wywr...wy)7¥ (3.14) 


= */ ——_____ (3.15) 
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If the model is based on bigrams, perplexity reduces to the expression: 


(3.16) 





Lower measures of perplexity imply that the model predicts the test data better, 
while higher perplexity values imply lower prediction quality. Note that it is impor- 
tant for the test sequence to be comprised of the same n-grams as was used to train 
the language model, or else the perplexity will be very high. 


3.8 Text Classification 


Text classification is a core task in many applications such as information retrieval, 
spam detection, or sentiment analysis. The goal of text classification is to assign doc- 
uments to one or more categories. The most common approach to building classi- 
fiers is through supervised machine learning whereby classification rules are learned 
from examples [SM99, CT94, Seb02]. We provide a brief overview of the process 
by which these classifiers are created. Readers can refer to the previous chapter for 
the details of the machine learning algorithms. 


3.8.1 Machine Learning Approach 


Most problems in computational linguistics end up as text classification problems 
that can be addressed with a supervised machine learning approach. Text classifica- 
tion consists of document representation, feature selection, application of machine 
learning classifier, and finally the evaluation of classifier performance. Feature se- 
lection can leverage any of the morphological, lexical, syntactic, semantic, or dis- 
course representations introduced in the previous sections. 

Given a set Digpeieq Of n documents, the first step is to construct representations 
of these documents in a feature space. The common method is to use a bag-of-words 
approach with n-gram frequency or TFIDF to create document vectors x; and their 
labeled categories y;: 


Diabeled = (%1,¥1) , (K2,92),---, (Xn, Yn) (3.17) 


With this data, we can train a classification model to predict labels of un-annotated 
text samples. Popular machine learning algorithms for text classification include K- 
nearest neighbor, decision trees, naive Bayes, support vector machines, and logistic 
regression. The general text classification pipeline can be summarized as: 
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Algorithm 1: Text classification pipeline 


Data: A set of documents Digpeied 
Result: A trained model h(x) 
begin 
preprocess documents (e.g., tokenize) 
create document representations x; 
split into train, validation, test sets 
for x; © X do 
' train machine learning classifier model on train set 
tune model on dev set; 


evaluate tuned model on test set 


3.8.2 Sentiment Analysis 


Sentiment analysis is a task that evaluates written or spoken language to determine 
if linguistic expressions are favorable, unfavorable, or neutral, and to what degree. It 
has widespread uses in business that include discerning customer feedback, gauging 
overall mood and opinion, or tracking human behavior. Sentiment encompasses both 
the affective aspects of text—how one’s emotions affect our communication—and 
subjective aspects of text—the expression of our emotions, opinions, and beliefs. 
Textual sentiment analysis is the task of detecting type and strength of one’s attitudes 
in sentences, phrases, or documents. 


3.8.2.1 Emotional State Model 


Models of emotion have been researched for several decades. An emotional state 
model is one that captures the human states of emotion. The Mehrabian and Rus- 
sell model, for instance, decomposes human emotional states into three dimensions 
(Table 3.8). There are other emotional state models used in sentiment analysis, in- 


Table 3.8: Mehrabian and Russell model of emotion 


Valence Measures the pleasurableness of an emotion 

also known as polarity 

Ambivalence is the conflict between positive and negative valence 
Arousal Measures the intensity of emotion 
Dominance Measures the dominion of an emotion over others 


cluding Plutchik’s wheel of emotions and Russell’s two-dimensional emotion cir- 
cumplex model (Fig. 3.7). 

The simplest computational approach to sentiment analysis is to take the set of 
words that describe emotional states and vectorize them with the dimensional values 
of the emotional state model [Tab+11]. The occurrence of these words 1s computed 
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Fig. 3.7: Plutchik’s wheel of emotions 


within a document, and the sentiment of the document is equal to the aggregated 
scores of the words. This lexical approach is very fast, but suffers from the inability 
to effectively model subtlety, sarcasm, or metaphor [RR15]. Negation (e.g., “not 
nice” vs. “nice’’) is also problematic with pure lexical approaches. 


The affective norms for English words (ANEW) dataset is a lexicon created 
by Bradley and Lang containing 1000 words scored for emotional ratings of 
valence, dominance, and arousal (Fig. 3.8). ANEW is very useful for longer 
texts and newswire documents. Another model is the SentiStrength model 
for short informal text developed by Thelwall et al., which has been applied 
successfully to analyze text and Twitter messages. 


3.8.2.2 Subjectivity and Objectivity Detection 


A closely related task in sentiment analysis is to grade the subjectivity or objec- 
tivity of a particular piece of text. The ability to separate subjective and objective 
parts, followed by sentiment analysis on each part, can be very useful for analysis. 
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Fig. 3.8: ANEW emotion lexicon subset 


Objectivity detection could help identify personal bias, track hidden viewpoints, and 
alleviate the “fake news” problem existing today [WROS5]. 

One approach to objectivity detection is to use n-grams or shallow parsing and 
pattern matching with a set of learned lexical—syntactic patterns. Another is to use 
lexical—syntactic features in combination with conversational discourse features to 
train a classifier for subjectivity. 


3.8.3 Entailment 


Textual entailment is the logical concept that truth in one text fragment leads to 
truth in another text fragment. It is a directional relation, analogous to the “if-then” 
clause. Mathematically, given text fragments X and Y, entailment is given by: 


P(Y|X) > P(X) (3.18) 


where P(Y|X) is considered the entailment confidence. Note that the relation X 
entails Y does not give any certainty that Y entails X (logical fallacy). 

Entailment is considered a text classification problem. It has widespread use in 
many NLP applications (e.g., question answering). Initial approaches toward entail- 
ment were logical-form based methods that required a many axioms, inference rules, 
and a large knowledge base. These theorem-proving methods performed poorly in 
comparison to other statistical NLP approaches [HMM16]. 
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Currently, the most popular entailment approaches are syntax based [AM10]. 
Parse trees are used to generate and compare similarity scores that are combined 
with an SVM or LR classifier to detect entailment. Such approaches are quite ca- 
pable of capturing shallow entailment but do poorly on more complex text (such as 
text that switches between active and passive voice). 

Recent semantic approaches have shown better ability to generalize by incorpo- 
rating semantic role labeling in addition to lexical and syntactic features [Bur-+07]. 
Even so, the gap between human level entailment and computational approaches is 
still significant. Entailment remains an open research topic. 


3.9 Text Clustering 


While text classification is the usual go-to approach for text analytics, we are often 
presented with a large corpus of unlabeled data in which we seek to find texts that 
share common language and/or meaning. This is the task of text clustering [Ber03]. 

The most common approach to text clustering is via the k-means algorithm 
[AZ12]. Text documents are tokenized, sometimes stemmed or lemmatized, stop 
words are removed, and text is vectorized using bag-of-words or TFIDF. K-means 
is applied to the resulting document-term matrix for different k values. 


Algorithm 2: Text clustering pipeline 


Data: A set of documents Dynjabeled 
Result: k text clusters 
begin 
preprocess documents (e.g., tokenize) 
create document representations x; 


for values of k do 
|. apply k-means algorithm 


choose best k value 


There are two main considerations when using k-means. The first is the notion 
of distance between two text fragments. For k-means, this is the Euclidean distance, 
but other measures like cosine distance could theoretically be used. The second is 
determining the value of k—how many different clusters of text exist within a cor- 
pus. As in standard k-means, the elbow method is most widely used for determining 
the value of k. 
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3.9.1 Lexical Chains 


Traditional approaches relying on bag-of-words ignore semantic relationships be- 
tween words in a document and do not capture meaning. A method that can in- 
corporate this semantic information 1s lexical chains. These chains originate from 
the linguistic concept of textual cohesion, where a sequence of related words are 
known to contain a semantic relation. For instance, the following words form a lex- 
ical chain: 


car —> automobile — sedan — roadster 


Usually, a lexical database like WordNet is utilized to both predict lexical chains 
and to cluster the resulting concepts. Lexical chains are useful for higher-order tasks 
such as text summarization and discourse segmentation [MNO2, Wei+ 15]. 


3.9.2 Topic Modeling 


Often, we have a collection of documents and want to broadly know what is dis- 
cussed within the collection. Topic modeling provides us the ability to organize, 
understand, and summarize large collections of text. A topic model is a statistical 
model used to discover abstract “topics” within in a collection of documents. It is a 
form of text mining, seeking to identify recurring patterns of words in discourse. 


3.9.2.1 LSA 


Latent semantic analysis (LSA) is a technique that seeks to identify relationships 
between a set of documents and words based on the implicit belief that words close 
in meaning will occur in similar pieces of text. It is one of the oldest methods for 
topic modeling [Bir+08]. It uses a mathematical technique named singular value 
decomposition (SVD) to convert the document-term matrix of a text corpus into 
two lower-rank matrices: a document-topic matrix that maps topics to documents, 
and a topic-word matrix that maps words to topics. In doing so, LSA acts to reduce 
the dimensionality of the corpus vector space while identifying higher-order patterns 
within the corpus. To measure relatedness, LSA utilizes the cosine distance measure 
between two term vectors. 

LSA is very easy to train and tune, and the two matrices derived from LSA can 
be reused for other tasks as they contain semantic information. Unfortunately for 
large collections of documents, LSA can be quite slow. 
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3.9.2.2 LDA 


Latent Dirichlet allocation (LDA) is a model that also acts to decompose a 
document-term matrix into a lower-order document-topic matrix and topic-word 
matrix. It differs from LSA in that it takes a stochastic, generative model approach 
and assumes topics to have a sparse Dirichlet prior. This is equivalent to the belief 
that only a small set of topics belong to any particular document and that topics 
mostly contain small sets of frequent words. As compared to LSA, LDA does better 
at disambiguation of words and identifies topics with finer details. 


3.10 Machine Translation 


Machine translation (abbreviated MT) refers to the process of translating text from 
a source language to a different target language. Language translation is hard even 
for humans to be able to fully capture meaning, tone, and style. Languages can 
have significantly different morphology, syntax, or semantic structure. For instance, 
it will be rare to find English words with more than 4 morphemes, but it is quite 
common in Turkish or Arabic. German sentences commonly follow the subject- 
verb-object syntactic structure, while Japanese mostly follows a subject-object-verb 
order, and Arabic prefers a verb-subject-object order. With machine translation, we 
typically focus on two measures: 


e Faithfulness = preserving the meaning of text in translation 
e Fluency = natural sounding text or speech to a native speaker. 


3.10.1 Dictionary Based 


In simplest form, machine translation can be achieved by a direct translation of 
each word using a bilingual dictionary. A slight improvement may be to directly 
translate word phrases instead of individual words [KOMO03]. Because of the lack 
of syntactic or semantic context, direct translation tends to do poorly in all but the 
simplest machine translation tasks [Dod02]. 

Another classical method for machine translation is based on learning lexical and 
syntactic transfer rules from the source to the target language. These rules provide 
a means to map the parse trees between languages, potentially altering the structure 
in the transformation. Due to the need for parsing, transfer methods are generally 
complex and difficult to manage, especially for large vocabularies. For this reason, 
classic machine translation systems usually take a combined approach, using direct 
translation for simple text structure and lexical/syntactic transfer for more messy 
cases. 
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3.10.2 Statistical Translation 


Statistical machine translation adopts a probabilistic approach to map from one lan- 
guage to another. Specifically, it builds two types of models by treating the problem 
as one similar to a Bayesian noisy channel problem in communications: 


e Language model (fluency) = P(X) 
e Translation model (faithfulness) = P(Y |X). 


The language model measures the probability that any sequence of words X is an ac- 
tual sentence—that is, there 1s consistency within a language. The translation model 
measures the conditional probability that a sequence of words Y in the target lan- 
guage 1s a true translation of a sequence of words X in the source language. A statis- 
tical machine translation model will find the best translation to the target language 
Y by optimizing for: 


Y = argmax P(X|Y)P(Y) (3.19) 
Y 


Statistical models are based on the notion of word alignment, which is a mapping 
of a sequence of words from the source language to those of a target language. 
Because of differences between languages, this mapping will almost never be one- 
to-one. Furthermore, the order of words may be quite different. 


BLEU (bilingual evaluation understudy) is a common method to measure 
the quality of machine translation [Pap+02]. It measures the similarity be- 
tween phrase-based model translations and human-created translations aver- 
aged over an entire corpus. Similar to precision, it is normally expressed as a 
value between O and | but sometimes scaled by a factor of 10. 


3.11 Question Answering 


Question answering (QA) is the NLP task of answering questions in natural lan- 
guage. It can leverage expert system, knowledge representation, and information 
retrieval methods. Traditionally, question answering is a multi-step process where 
relevant documents are retrieved, useful information is extracted from these docu- 
ments, possible answers are proposed and scored against evidence, and a short text 
answer in natural language is generated as a response. 


Question: Who won the Tournament of Champions on Jeopardy in 2011? 


Answer: IBM Watson debuted a system named DeepQA in 2011 that went on to win first 
place against the legendary champions on Jeopardy. 
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Early question answering systems focused only on answering a predefined set of 
topics within a particular domain [KM11]. These were known as closed-domain QA 
systems, as opposed to open-domain QA systems that attempt to answer queries in 
any topic. Closed-domain systems often avoided the complexity of dialog process- 
ing and produced structured, pattern-based answers derived directly from expert 
systems. Modern open-domain QA systems provide much richer capability and in 
theory can leverage an unlimited set of knowledge sources (e.g., the internet) to 
answer questions through statistical processing. 

Question decomposition is the first step in any QA system, where a question is 
processed to form a query. In simple versions, questions would be parsed to find 
keywords which served as queries to an expert system to produce answers. This is 
known as query formation, where keywords are extracted from the question to for- 
mulate a relevant query. Sometimes, query expansion is used to identify additional 
query terms similar to those within a question [CR12]. In more advanced versions, 
syntactic processing (e.g., noun-phrases) and semantic processing (e.g., extracting 
entities) can be used to enrich extraction. Another method is query reformation, 
where the entities in the question are extracted along with its semantic relation. For 
instance, the following sentence and semantic relation: 


Who invented the telegraph? — Invented (Person, telegraph) 


An answer module can pattern-match this relation against semantic databases and 
knowledge bases to retrieve a set of candidate answers. The candidates are scored 
against evidence and the one with the highest confidence is returned as a natural 
language response. Some questions are easier to answer. For instance, it is much 
easier to determine the date or year of an event (e.g., when was Superbowl XX) than 
it is to relate entities in particular contexts (e.g., which city is most like Toronto). 
The former would require only a small, targeted search while the latter search space 
is much larger. 


3.11.1 Information Retrieval Based 


Web-based question answering systems like Google Search are based on informa- 
tion retrieval (IR) methods that leverage the web. These text-based systems seek to 
answer questions by finding short texts from the internet or some other large collec- 
tion of documents. Typically, they map queries into a bag-of-words and use methods 
like LSA to retrieve a set of relevant documents and extract passages within them. 
Depending on the question type, answer strings can be generated with a pattern- 
extraction approach or n-gram tiling methods. IR-based QA systems are entirely 
statistical in nature and are unable to truly capture meaning beyond distributional 
similarity. 
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3.11.2 Knowledge-Based QA 


Knowledge-based question answering systems, on the other hand, take a semantic 
approach. They apply semantic parsing to map questions into relational queries over 
a comprehensive database. This database can be a relational database or knowledge 
base of relational triples (e.g., subject-predicate-object) capturing real-world rela- 
tionships such as DBpedia or Freebase. Because of their ability to capture mean- 
ing, knowledge-based methods are more applicable for advanced, open-domain 
question-answering applications as they can bring in external information in the 
form of knowledge bases [Fu+12]. At the same time, they are constrained by the 
set relations of those knowledge bases (Fig. 3.9). 


DBpedia is a free semantic relation database with 4.6 million entities ex- 
tracted from Wikipedia pages in multiple languages. It contains over 3 billion 
relational triples expressed in the resource description framework (RDF) for- 
mat. DBpedia is often considered the foundation for the semantic web, also 
known as the linked open data cloud. First released in 2007, DBpedia contin- 
ues to evolve through crowdsourced updates in similar fashion to Wikipedia. 


Question 





Fig. 3.9: Open-domain QA system 


3.11.3 Automated Reasoning 


Recent QA systems have begun to incorporate automated reasoning (AR) to extend 
beyond the semantic relations of knowledge-based systems. Automated reasoning 
is a field in artificial intelligence that explores methods in abductive, probabilistic, 
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spatial and temporal reasoning by computer systems. By creating a set of first-order 
logic clauses, QA systems can enhance a set of semantic relations and evidence 
retrieved in support of answer hypotheses [FGP10]. Prolog is a common declarative 
language approach used to maintain this set of clauses. 

IBM Watson’s DeepQA is an example of a question answering system that in- 
corporates a variety of IR-based, knowledge-based, and automated reasoning meth- 
ods. By leveraging reportedly 100 different approaches and knowledge base sources 
to generate candidate answers that are evidence-scored and merged [Wan-+12], 
DeepQA was able to exceed human level performance in the game of Jeopardy in 
2011. IBM has since deployed DeepQA into a variety of other domains with varying 
SUCCESS. 


A common metric used to measure question answering system performance 
is Mean reciprocal rank (MRR). It is based on using a gold set of questions 
that have been manually labeled by humans with correct answers. To evaluate 
a QA system, the set of ranked answers of the system would be compared 
with the gold set labels of a corpus of N questions, and the MRR is given by: 


MRR =~ ys : 
ONG 





3.20 
rank; ( ) 


Current state-of-the-art QA systems exceed MRR = 0.83 on the commonly 
used TREC-QA benchmark. 


3.12 Automatic Summarization 


Automatic summarization is a useful NLP task that identifies the most relevant in- 
formation in a document or group of documents and creates a summary of the con- 
tent. It can be an extraction task that takes the most relevant phrases or sentences 
in original form and uses them to generate the summary, or an abstraction task that 
generates natural language summaries from the semantic content [AHG99, BNOO]. 
Both approaches mirror how humans tend to summarize text, though the former 
extracts text while the latter paraphrases text. 


3.12.1 Extraction Based 


Extraction-based summarization is a content selection approach to distilling docu- 
ments. In most implementations, it simply extracts a subset of sentences deemed 
most important. One method to measure importance is to count informative words 
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based on lexical measures (e.g., TFIDF). Another is to use discourse measures (e.g., 
coherence) to identify key sentences. Centroid-based methods evaluate word prob- 
ability relative to the background corpus to determine importance. A creative ap- 
proach called TextRank takes a graph-based approach to assign sentence scores 
based on lexical similarity of words. As long as plagiarism is not a concern, 
extraction-based summarization is the more popular approach. 


3.12.2 Abstraction Based 


Unlike extraction-based copying, abstraction-based approaches take a semantic ap- 
proach. One method is to use entity recognition and semantic role labeling to iden- 
tify relations. These can be fed into standard templates (e.g., mad-lib approach) or a 
natural language generation engine to create synopses. The use of lexical chains can 
aid in the identification of central themes, where the strongest chain is indicative of 
the main topic [SMOO]. 

Automatic summarization remains a difficult task. State-of-the-art methods are 
around the 35% precision level, with performance differing greatly by underlying 
document type [GG17]. Deep learning methods hold significant promise, as we will 
see in a later chapter. 


3.13 Automated Speech Recognition 


Automatic speech recognition (ASR) is the NLP task of real-time computational 
transcription of spoken language. ASR has been at the forefront in the study of 
human—computer interfaces since the 1950s. With the advent of personal AI assis- 
tants like Siri, Alexa, or Cortana, the importance of ASR has skyrocketed in recent 
years. The ultimate goal of ASR is human-level (near 100%) speech transcription. 
Current ASR in perfect conditions can only approach 95% [Bak+09]. Evolution 
has given us the ability to recognize speech in a variety of conditions (e.g., noise, 
accents, diction, and tone) that computers cannot yet deal with, and much room for 
improvement in ASR remains. In the next sections, some background on the compu- 
tational representation of speech and the classical approaches to ASR are provided. 


3.13.1 Acoustic Model 


An acoustic model is a representation of the sounds in an audio signal used in auto- 
matic speech recognition. Its main purpose is to map acoustic waves to the statistical 
properties of phonemes, which are elementary linguistic units of sound that distin- 
guish one word from another in a language. Consider an audio signal as a sequence 
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of short, consecutive time frames S$ = 51,52,...S7. Let a sequence of M phonemes 
be represented by F = ff), fo,... fy and a sequence of N words be represented by 
W =w1,W2,..-Wy. In speech recognition, the goal is to predict the set of words W 
from the audio input S: 


W = argmax P(W|S) (3.21) 
W 


W = argmax P(S|F) P(F|W)P(W) (3.22) 
W 


Here, P(W ) represents the probability that a string of words is an English sentence— 
that is, P(W) is the language model. The quantity P(S|F’) is known as the pronunci- 
ation model and the quantity P(F'|W) is the acoustic model. 


3.13.1.1 Spectrograms 


A spectrogram is a visual representation of the frequencies of an acoustic signal over 
a period of time, where the horizontal axis is time, the vertical axis is frequency, and 
the intensity of the audio signal is represented by the color at each point. A spec- 
trogram is generated using a sliding time window in which a short-time Fourier 
transform is performed. As a time-frequency visualization of a speech signal, spec- 
trograms are useful for both speech representations and for evaluation of text to 
speech systems (Fig. 3.10). 





Fig. 3.10: Spectrogram 


3.13.1.2 MFCC 


Mel-frequency cepstral coefficients (MFCCs) are another useful representation of 
speech signals. MFCCs transform continuous audio signals into feature vectors, 
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each representing a small window in time. Consider the cepstrum, which is the in- 
verse fast-Fourier transform of the log of the fast-Fourier transform of an audio 
signal (Fig. 3.11): 
7 2 

C=|F'(logF(f(t)))| (3.23) 
MFCCs is similar to the cepstrum and is given by taking the discrete cosine trans- 
form of the log of the fast-Fourier transform of an audio signal where a triangular 
filter of Mel frequency banks has been applied. MEL filters are placed linearly for 
frequencies less than 1000Hz and on a log scale for frequencies above 1000Hz, 
closely corresponding to the response of the human ear: 


C = DCT (log (MEL(F (f(t))))) (3.24) 


MFCCs contain both time and frequency information about the audio signal. They 
are particularly useful for ASR because cepstral features are effectively orthogonal 
to each other and robust to noise. 





0.00005 0.00105 0.00205 0.00305 0.00405 0.00505 


Fig. 3.11: Cepstrum 


3.14 Case Study 


To provide further insight on applications of natural language processing, we present 
the following case study to guide readers through an application of text clustering, 
topic modeling, and text classification principles. The case study is based on the 
Reuters-21578 dataset, a collection of 21578 newswire stories from 1987. We begin 
by cleaning the dataset and transforming it into a format that permits easier analysis. 
Through exploratory data analysis, we will examine corpus structure and identify if 
text clusters exist and to what degree. We will model topics within the corpus, and 
compare our findings with the annotations provided in the dataset. Finally, we will 
explore various methods to classify the documents by topic. Hopefully, this case 
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study will reinforce the fundamental principles of text analytics as well as identify 
key gaps in classical NLP. 


3.14.1 Software Tools and Libraries 


For this case study, we will use Python and the following libraries: 


e Pandas (https://pandas.pydata.org/) is a popular open source implementation for 
data structures and data analysis. We will use it for data exploration and some 
basic processing. 

e scikit-learn (http://scikit-learn.org/) is a popular open source for various ma- 
chine learning algorithms and evaluations. We will use it only for sampling, cre- 
ating datasets, and machine learning implementations of linear and non-linear 
algorithms in our case study. 

e NLTK (https://www.nltk.org/) is a suite of text and natural language processing 
tools. We will use it to convert text into vectors for processing. 

e Matplotlib (https://matplotlib.org/) is a popular open source library for visual- 
ization. We will use it to visualize performance. 


3.14.2 EDA 


Our first task is to take a close look at the dataset by loading and performing ex- 
ploratory data analysis. To do so, we must extracting metadata and the text body 
from each document in the corpus. If we take a close look at the corpus, we find 
(Figs. 3.12, 3.13 and 3.14): 


1. There are 11,367 documents that have one or more topic annotations. 
2. The greatest number of topics in a single document is 16. 

3. There are a total of 120 distinct topic labels in the corpus. 

4. There are 147 distinct place and 32 organization labels. 


So far so good. But before we perform any NLP analysis, we will want to perform 
some cursory text normalization: 
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Fig. 3.12: Document count by organization 
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Fig. 3.13: Document count by non-US location 


1. Transform to lower case 


2. Remove punctuation and numbers 
3. Stem verbs 
4. Remove stopwords. 


To do so, we define a SimpleTokenizer method that will be useful when creating 
document representations. 


1 
2 
3 
4 
5 


6 


import re 
import nltk 


from 
from 
from 
from 


nltk import word_tokenize 
nltk.corpus import stopwords 


nltk.stem.porter import PorterStemmer 


sklearn. preprocessing. label 


import MultiLabelBinarizer 
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Fig. 3.14: Document count by topic 


from sklearn.feature_extraction.text import TfidfVectorizer 


nitk .download(” punkt” ) 
nltk .download(” stopwords” ,” data” ) 
niltk.data.path.append(’ data’ ) 


labelBinarizer = MultiLabelBinarizer () 
data_target = labelBinarizer.fit_transform(data_set[u’ topics’ 
}) 


stopWords = stopwords.words(’ english’ ) 
charfilter = re.compile(’ [a—zA-Z]+’); 


def SimpleTokenizer(text): 

words = map(lambda word: word.lower(), word_tokenize(text) ) 

words = [word for word in words if word not in stopWords | 

tokens = (list (map(lambda token: PorterStemmer().stem(token) , 
words ) ) ) 

ntokens = list(filter(lambda token: charfilter.match(token), 
tokens ) ) 

return ntokens 


vec = TfidfVectorizer(tokenizer=SimpleTokenizer , 
max_features=1000, 
norm=’ 12° ) 


mytopics = [u’cocoa’,u’trade’ ,u’money—supply’ ,u’ coffee’ ,u’ gold 
“| 

data_set data_set[data_set[u’topics’].map(set(mytopics). 
intersection ) 

.apply( lambda x: len(x)>0 )|]| 

docs = list(data_set[u’ body’ ]. values ) 


dtm = vec. fit_transform (docs ) 
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3.14.3 Text Clustering 


We want to see if clusters exist in the documents, so let’s create some document 
representations through TFIDF. This gives us a document-term matrix, but typically 
the dimensions of this matrix are too large and the representations are sparse. Let’s 
first apply principal component analysis (PCA) to reduce the dimensionality. The 
original TFIDF vectors have dimension = 1000. Let’s take a look at the effect of 
dimensionality reduction by plotting the proportion of explained variance of the 
data as a function of the number of principal components (Fig. 3.15): 


i from sklearn.decomposition import PCA 


3 explained_var = [] 

4 for components in range(1,100,5): 

5 pca = PCA(n_components=components ) 

6 pcea.fit(dtm.toarray () ) 

7 explained_var.append(pca.explained_variance_ratio_.sum() ) 


99 


9 plt.plot(range(1,100,5) ,explained_var ,”ro”) 


i0 plt.xlabel(’” Number of Components” ) 
1 plt.ylabel(” Proportion of Explained Variance”) 


0.6 ° 
0.5 — 

0.4 * 

0.3 : 


0.2 


Proportion Explained Variance 


0.1 


0 20 40 60 80 100 
Number of Components 


Fig. 3.15: Explained variance by number of PCA components 


The graph above shows that half of the variance can be explained by 60 components. 
Let’s apply this to the dataset, and visualize the results by plotting the first two PCA 
components of each document (Fig. 3.16). 


| from sklearn.decomposition import PCA 
2 import seaborn as sns 


4 components = 60 
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6 palette = np.array(sns.color_palette(”hls”, 120)) 

s pea = PCA(n_components=components ) 

9 pea. fit (dtm. toarray () ) 


i0 pcea_dtm = pca.transform(dtm.toarray () ) 


2 plt.scatter(pca_dtm[: ,O],pca_dtm[:,1], 
13 c=palette[data_target .argmax(axis=1).astype(int ) ]) 


is explained_variance = pca.explained_variance_ratio_.sum() 
6 Primt(” Explained variance of the PCA step. {}% <tormar( 
7 int(explained_variance x 100))) 
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0.2 
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Fig. 3.16: PCA document projection 


We know that there are 5 distinct topics (though some documents might have over- 
lap), so let’s run the k-means algorithm with k = 5 to examine document grouping 
(Fig. 3.17). 


1 from sklearn.cluster import KMeans 
2 palette = np.array(sns.color_palette(”hls”, 5)) 


4 model = KMeans(n_clusters =5,max_iter=100) 
5 clustered = model. fit (pca_dtm ) 

6 centroids = model.cluster_centers~_ 

7 y = model. predict (pca_dtm ) 


9 ax plt.subplot() 
i0 $C = ax. scatter(pca_dtm[: ,0],pca_dtm[:,1], 
i c=palette[y.astype(np. int) ]) 


How does this compare with the manually annotated labels? (Fig. 3.18) 


| palette = np.array(sns.color_palette(”hls”, 5)) 


5 
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Fig. 3.18: Manually labeled clusters 


3 gold_labels = data_set[’topics’].map(set(mytopics ). 
intersection ) 
4 .(lambda x: x.pop()).apply(lambda x: mytopics.index(x)) 


6 ax 
SC 


plt.subplot() 
ax.scatter(pca_dtm[: ,0O],pca_dtm[:,1],c=palette[ 
gold_labels }) 


~ 
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3.14.4 Topic Modeling 


In addition to the lexical clustering of documents, let’s see if we can discern any 
natural topic structure within the corpus. We apply the LSA and LDA algorithms, 
which will associate words to a set of topics, and topics to our set of documents. 


3.14.4.1 LSA 


We start with the LSA algorithm and set the number of dimensions to 60 (Fig. 3.19): 


i from sklearn.decomposition import TruncatedSVD 
2 import seaborn as sns 


4 components = 60 
6 palette = np.array(sns.color_palette(”hls”, 120)) 


s Ilsa = TruncatedSVD(n_components=components ) 
o Isa. fit (dtm) 
i Isa_dtm = lsa.transform(dtm) 


2 plt.scatter(Ilsa_dtm[: ,0],Ilsa_dtm[:,1], 
13 c=palette[data_target .argmax(axis=l).astype(int) 


}) 


5 explained_variance = Isa.explained_variance_ratio_.sum() 
‘6 PrinuC@ Explaimed. variance of the SVD step.) % formar ( 
7 int(explained_variance x 100))) 


As with PCA, let’s apply k-means with k = 5 clusters (Fig. 3.20). 


from sklearn.cluster import KMeans 
palette = np.array(sns.color_palette("hls”, 8)) 


model = KMeans(n_clusters=5,max_iter=100) 
s| clustered = model. fit (lsa_dtm ) 

centroids = model.cluster_centers_ 

y = model. predict(Isa_dtm ) 


ax = plt.subplot() 
sc = ax. scatter(lsa_dtm[: ,0],lsa_dtm[:,1],c=palette[y.astype ( 


np.int) ]) 
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Fig. 3.20: k-means on LSA 


Let’s examine the documents of one of these clusters: 


232 
239 
249 
290 
402 
42 

562 
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754 
842 


Talks on the possibility of reintroducing... 
Indonesia’s agriculture sector will grow... 

The International Coffee Organization. .. 

Talks on coffee export quotas at the... 

Coffee quota talks at the International... 
International Coffee Organization, ICO,... 
Talks at the extended special meeting of... 
International Coffee Organization (ICO)... 
Efforts to break an impasse between... 

A special meeting of the International Coffee. .. 
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3.14.4.2 LDA 


Let’s see if the LDA algorithm can do better as a Bayesian approach to document 
clustering and topic modeling. We set the number of topics to the known number of 
topics = 5. 


i import numpy as np 

import seaborn as sns 

3 from sklearn.cluster import KMeans 

4 from sklearn.decomposition import LatentDirichletAllocation 


N 


6 components = 5 
7 n_top_words = 10 


9 palette = np.array(sns.color_palette(”hls”, 120)) 


i def print_top_words(model, feature_names , n_top_words): 





2 for topic_idx , topic in enumerate(model.components_): 
message = "Topic #%d: ” % topic_idx 

4 message += ” ”.join([feature_names [1] 

5 for 1 in topic.argsort()[:—n_top_words — 1:-—1]]) 
6 print (message ) 

7 print () 

8 

09 Ida = LatentDirichletAllocation(n_components=components , 


2 «max_iter=5,learning_method=’ online’ ) 
2 Ida. fit (dtm) 
2 Ida_dtm = lda.transform(dtm) 


2 vec_feature_names = vec. get_feature_names () 
2 print_top_words(lda, vec_feature_names , n_top_words ) 


Topic 0 said trade u.s. deleg quota brazil export year coffe market 
Topic | gold mine ounc ton said Itd compani ore feet miner 

Topic 2 fed volcker reserv treasuri bank borrow pct rate growth dlr 
Topic 3 said trade u.s. export japan coffe would ec market offici 
Topic 4 billion dlr mln pct januari februari rose bank fell year 


The LDA results are encouraging, and we can easily discern 4 of the 5 original 
topics from the list of words associated with each topic. 


3.14.5 Text Classification 


Now let’s see if we can build classifiers to possibly identify the topics above. We 
first randomize and split our dataset into train and test sets. 


1 from sklearn.model_selection import train_test_split 


5 
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data_set[’label’] = gold_labels 


5 X_train, X_test, y-_train, y_test = train_test_split(data_set , 
gold_labels ,test_size=0.2, random_state=10) 

6 pPrinn( Vrain Sen — 9) lem (x strain )) 

7 Primi Mest Sct =<". lenm CX ses t) ) 


9 K_train = X_train[u’ body’ ] 
i0 X_test = X_test[u’ body’ ] 


We then create a pipeline that builds classifiers based on 5 models: naive Bayes, 
logistic regression, SVM, K-nearest neighbor, and random forest. 


i from sklearn.naive_bayes import MultinomialNB 

from sklearn.linear_model import LogisticRegression 
3 from sklearn.svm import LinearSVC 

4 from sklearn.neighbors import KNeighborsClassifier 
5 from sklearn.ensemble import RandomForestClassifier 


i) 


7 models = [(’multinomial_nb’, MultinomialNB ()), 

8 (’log_reg’, LogisticRegression()), 

9 C limearesve. =. Linears Ve @)). 

10 (’knn’, KNeighborsClassifier(n_neighbors=6) ), 

1 (?rf’?, RandomForestClassifier(n_estimators =6) ) |] 


We then train each model on the training set and evaluate on the test set. For each 
model, we want to see the precision, recall, Fl score, and support (number of sam- 
ples) for each topic class. 


1 from sklearn.pipeline import Pipeline 
2 from sklearn.metrics import classification_report 


4 for m_name, model in models: 

5 pipeline = Pipeline([( vec’, TfidfVectorizer(tokenizer= 
SimpleTokenizer )) ,(m_name, model ) ]) 

6 pipeline. fit (X-_train , y_train ) 

7 test_y = pipeline. predict(X_test) 

8 print(classification_report(y_test ,test_y , digits =6)) 


The results seem to indicate that a linear SVM model seems to perform the best, 
with random forest a close second. This is a bit misleading, since we didn’t tune 
any of these models to obtain our results. Hyperparameter tuning can significantly 
affect how well a classifier performs. Let’s try tuning the LinearSVC model. We 
want to tune parameters by using grid search with cross-validation. Note that cross- 
validation is important as we do not want to tune with our test set, which we will 
use only at the end to assess performance. Note also that this can take a while! 


| from sklearn.model_selection import GridSearchCV 


No 


3 pipeline = Pipeline ([( vec ’ , vectorizer ) , 
4 (’model’, model) ]) 
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; 
6 Parameters —  {° VeC-eneram range ~9((l 1) ele 2). 

7 WecCoemax ediures (00s shOO0)= 

8 “modelzsloss = ( himee  .— squaredehimeec ).. 
9 mmodel 677 A(t 30.9 


i grid_search = GridSearchCV(pipeline , parameters , verbose=1) 
2 grid_search.fit(X_train, y train) 


4 test_y = grid_search.best_estimator-_.predict(X_test) 
is print(classification_report(y_test ,test_y , digits =6)) 


As you see, the SVM model typically outperforms other machine learning algo- 
rithms, and often provides state-of-the-art quality (Fig. 3.21). Unfortunately, SVM 
suffers from several major drawbacks, including the inability to scale to large 
datasets. As we will learn in later chapters, neural networks can bypass the limi- 
tations of SVMs. 


Test Set Test Set Test Set 
Precision La t=Jer-1) F1 
Naive Bayes 0.8262 0.7361 0.7048 
Logistic Regression 0.8929 0.8704 0.8606 
Linear SVM 0.9567 0.9537 0.9541 
K Nearest Neighbors 0.5802 0.3981 0.3959 
Random Forest 0.8854 0.8843 0.8803 


Fig. 3.21: Classification results 


3.14.6 Exercises for Readers and Practitioners 


Here are further exercises for the reader to consider: 


1. Instead of TFIDF, what other document representations can we try? 

2. How can we incorporate syntactic information to enhance the text clustering 
task? 

3. What semantic representations could be useful for text classification? 

What are some other ways to cluster documents? 

5. Can we combine classification models to improve prediction accuracy’? 


oa 
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Basics of Deep Learning ee 


4.1 Introduction 


One of the most talked-about concepts in machine learning both in the academic 
community and in the media is the evolving field of deep learning. The idea of neural 
networks, and subsequently deep learning, gathers its inspiration from the biological 
representation of the human brain (or any brained creature for that matter). 

The perceptron is loosely inspired by biological neurons (Fig. 4.1), connecting 
multiple inputs (signals to dendrites), combining and accumulating these inputs (as 
would take place in the cell body proper), and producing an output signal that re- 
sembles an axon. 


Dendrites 


Axon = — 
Axon Terminals 


Cell Body 


Fig. 4.1: Diagram of a biological neuron 


Neural networks extend this analogy, combining a network of artificial neurons to 
create a neural network where information is passed between neurons (synapses), as 
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illustrated in Fig. 4.2. Each of these neurons learns a different function of its input, 
giving the network of neurons an extremely diverse representational power. 
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Fig. 4.2: Diagram of an artificial neuron (perceptron) 


The last 6-7 years have seen exponential growth in the popularity and applica- 
tion of deep learning. Although the foundations of neural networks can be traced 
back to the late 1960s [Iva68], the AlexNet architecture [KSH12c] ushered in an 
explosion of interest in the deep learning when it handily won the 2012 Imagenet 
image classification competition [Den+09b] with a 5-layer convolutional neural net- 
work. Since then deep learning has been applied to a multitude of domains and has 
achieved state-of-the-art performance in most of these areas. 

The purpose of this chapter is to introduce the reader to deep learning. By the end 
of this chapter, the reader should be able to understand the basics of neural networks 
and how to train them. We begin this chapter with a review of the perceptron algo- 
rithm that was introduced in Chap. 2, where neural networks found their origin. We 
then introduce the multilayer perceptron (MLP) classifier, the most simplistic form 
of feed-forward neural networks. Following this is a discussion of the essential com- 
ponents of training an MLP. This section contains an introduction to both forward 
and back propagation and explains the overall training process for neural networks. 
We then move toward an exploration of the essential architectural components: ac- 
tivation functions, error metrics, and optimization methods. After this section, we 
broaden the MLP concept to the deep learning domain, where we introduce addi- 
tional considerations when training deep neural networks, such as computation time 
and regularization. Finally, we conclude with a practical discussion of common deep 
learning framework approaches. 
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, : 





Fig. 4.3: The step function performs perfectly adequately for the perceptron; how- 
ever, the lack of a non-zero gradient makes it useless for neural networks 


4.2 Perceptron Algorithm Explained 


Deep learning in its simplest form is an evolution of the perceptron algorithm, 
trained with a gradient-based optimizer. Chapter 2 introduced the perceptron al- 
gorithm. This section propounds the importance of the perceptron algorithm as one 
of the building blocks of deep learning. 

The perceptron algorithm is one of the earliest supervised learning algorithms, 
dating back to the 1950s. Much like a biological neuron, the perceptron algorithm 
acts as an artificial neuron, having multiple inputs, and weights associated with each 
input, each of which then yields an output. This is illustrated in Fig. 4.6b. 

The basic form of the perceptron algorithm for binary classification 1s: 


y(X1,---5Xn) = f(wix, +... + Warn). (4.1) 


We individually weigh each x; by a learned weight w; to map the input x € IR” 
to an output value y, where f(x) is defined as the step function shown below and in 
Fig. 4.3. 


0 ifv<05 
roy= 4 if v>0.5 2) 


The step function takes a real number input and yields a binary value of 0 or 1, 
indicating a positive or negative classification if it exceeds the threshold of 0.5. 


4.2.1 Bias 


The perceptron algorithm learns a hyperplane that separates two classes. However, 
at this point, the separating hyperplane cannot shift away from the origin, as shown 
in Fig. 4.4a. Restricting the hyperplane in this fashion causes issues, as we can see 
in Fig. 4.4b. 
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Fig. 4.4: (a) The perceptron algorithm is able to separate the two classes with the 
line passing through the origin. (b) Although the data is linearly separable, the per- 
ceptron algorithm is not able to separate the data. This is due to the restriction of the 
separating plane needing to pass through the origin 


One solution is to ensure that our data is learnable if we normalize the method 
to center around the origin as a potential solution to alleviate this issue or add a 
bias term b to Eq. 4.1, allowing the classification hyperplane to move away from the 
origin, as shown in Fig. 4.5. 
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Fig. 4.5: (a) The perceptron algorithm is able to separate the two classes after cen- 
tering the data at the origin. Note the location of the origin in the figure. (b) The 
bias allows the perceptron algorithm to relocate the separating plane, allowing it to 
correctly classify the data points 
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We can write the perceptron with a bias term as: 
y(xX1,---,Xn) = f(wixt+...+Wnrn +d) (4.3) 


Alternatively, we can treat b as an additional weight wo tied to a constant input 
of 1 as shown in Fig. 4.6b and write it as: 


y(X1,---,Xn) = f(wixy +... + Warn + Wo) (4.4) 





a Ww 

y “1 a ng ia ’ , c inal 7 _ 
we, yy —- ee y 

Xn j Wn | Xn , Wh 

= 4 


(i) (b) 


Fig. 4.6: Perceptron classifier diagram. (a) Perceptron classifier diagram drawn 
without the bias. (b) Perceptron diagram including the bias 


Some authors describe this as adding an input constant x9 = 1, allowing the 
learned value for b = wo to move the decision boundary away from the origin. We 
will continue to write the bias term for now as a reminder of its importance; how- 
ever, the bias term is implicit even when not written, which is commonly the case in 
academic literature. Switching to vector notation, we can rewrite Eq. 4.3 as: 


y(x) = f(wx+0). (4.5) 


The bias term is a learned weight that removes the restriction that the separat- 
ing hyperplane must pass through the origin. 


The learning process for the perceptron algorithm is to modify the weights w to 
achieve O error on the training set. For example, suppose we need to separate sets 
of points A and B. Starting with random weights w, we incrementally improve the 
boundary through each iteration with the aim of achieving E(w,b) = 0. Thus, we 
would minimize the error of the following function over the entire training set. 


E(w) = > (1—f(wx+b))+ ¥ f(wx-+b) (4.6) 


xcA xcB 
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4.2.2 Linear and Non-linear Separability 


Two sets of data are linearly separable if a single decision boundary can separate 
them. For example, two sets, A and B, are linearly separable if, for some decision 
threshold t, every x; € A satisfies the inequality >’; wix; > t and every y; € B satisfies 
Yj Wivi < t. Conversely, two sets are not linearly separable if separation requires a 
non-linear decision boundary. 

If we apply the perceptron to a non-linearly separable dataset, like the dataset 
shown in Fig. 4.7a, then we are unable to separate the data as shown in 4.7b since 
we are only able to learn three parameters, w,, w2, and b. 





(a) (b) 


Fig. 4.7: (a) Non-linearly separable dataset (generalization of the XOR function). 
(b) Result of training the perceptron algorithm on the non-linearly separable dataset 
in (a). The linear boundary is incapable of classifying the data correctly 


Unfortunately, most data that we tend to encounter in NLP and speech is highly 
non-linear. One option (as we saw in Chap. 2) is to create non-linear combinations 
of the input data and use them as features in the model. Another option is to learn 
non-linear functions of the raw data, which is the principal aim of neural networks. 


4.3 Multilayer Perceptron (Neural Networks) 


The multilayer perceptron (MLP) links multiple perceptrons (commonly referred to 
as neurons) together into a network. Neurons that take the same input are grouped 
into a layer of perceptrons. Instead of using the step function, as seen previously, we 
substitute a differentiable, non-linear function. Applying this non-linear function, 
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commonly referred to as an activation function or non-linearity, allows the output 
value to be a non-linear, weighted combination of its inputs, thereby creating non- 
linear features used by the next layer. In contrast, using a linear function as the 
activation function restricts the network to only being able to learn linear transforms 
of the input data. Furthermore, it is shown that any number of layers with a linear 
activation function can be reduced to a 2-layer MLP [HSW89]. 

The MLP is composed of interconnected neurons and is, therefore, a neural net- 
work. Specifically, it 1s a feed-forward neural network, since there is one direction to 
the flow of data through the network (no cycles—recurrent connections). Figure 4.8 
shows the simplest multilayer perceptron. An MLP must contain an input and out- 
put layer and at least one hidden layer. Furthermore, the layers are also “fully con- 
nected,” meaning that the output of each layer is connected to each neuron of the 
next layer. In other words, a weight parameter is learned for each combination of 
input neuron and output neuron between the layers. 


(1) 





Fig. 4.8: Illustration of the multilayer perceptron network with an input layer, one 
hidden layer containing two neurons, and an output layer. The hidden layer, h, is 
the result of h = g(W“)x), where g(x) is the activation function. The output of 
the network $ = f(W)h), where f(x) is the output function, such as the step or 
sigmoid function 


The hidden layer provides two outputs, 4; and h2, which may be non-linear com- 
binations of their input values x; and x2. The output layer weighs its inputs from the 
hidden layer, now a potential non-linear mapping, and makes its prediction. 


4.3.1 Training an MLP 


Training the weights of the MLP (and by extension, a neural network) relies on four 
main components. 
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Steps to train a neural network: 


1. Forward propagation: Compute the network output for an 
input example. 

2. Error computation: Compute the prediction error between the network 
prediction and the target. 

3. Backpropagation: Compute the gradients in reverse order with respect 
to the input and the weights. 

4. Parameter update: Use stochastic gradient descent to update the weights 
of the network to reduce the error for that example. 


We will walk through each of these components with the network shown in 
Fig. 4.8. 


4.3.2 Forward Propagation 


The first step in training an MLP is to compute the output of the network for an 
example from the dataset. We use the sigmoid function, represented by (x), as the 
activation function for the MLP. It can be thought of as a smooth step function and 
is illustrated in Fig.4.14. Additionally, it 1s continuously differentiable, which is a 
desirable property for backpropagation, as is shown momentarily. The definition of 
the sigmoid function 1s: 


_ 1 
~ [tex 





0 (x) (4.7) 

The forward propagation step is very similar to steps 3 and 4 of the perceptron 
algorithm. The goal of this process is to compute the current network output for 
a particular example x, with each output connected as the input to the next layer’s 
neuron(s). 

For notational and computational convenience, the layer’s weights are combined 
into a single weight matrix, W;, representing the collection of weights in that layer, 
where / is the layer number. The linear transform performed by the layer computa- 
tion for each weight is an inner product computation between x and W;. This type 
is regularly referred to as a “fully connected,” “inner product,” or “linear” layer be- 
cause a weight connects each input to each output. Computing the prediction ¥ for 
an example x where h, and hz represent the respective layer outputs becomes: 


hy = f(W,x +b) (4.8) 
hy = f (Wohi +b2) 
=>, 
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Note the bias b, 1s a vector because there is a bias value associated with each 
neuron in the layer. There is only one neuron in the output layer, so the bias b2 is a 
scalar. 

By the end of the forward propagation step, we have an output prediction for our 
network. Once the network is trained, a new example is evaluated through forward 
propagation. 


4.3.3 Error Computation 


The error computation step verifies how well our network performed on the example 
given. We use mean squared error (MSE) as the loss function used in this example 
(treating the training as a regression problem). MSE is defined as: 


($;-yi)*. (4.9) 


Ms: 


1 
ans 


i 


The 4 simplifies backpropagation. With a single output this quantity is reduced 
to: 


1 
E($,y) = 5(9—y)’. (4.10) 


Error functions will be explored more in Sect. 4.4.2. 

This error function is commonly used for regression problems, measuring the 
average of the square errors for the target. The squaring function forces the error 
to be non-negative and functions as a quadratic loss with the values closer to zero, 
yielding a polynomially smaller error than values further from zero. 

The error computation step produces a scalar error value for the training example. 
We will talk more about error functions in Sect. 4.4.2. 


A> = wh, + wi hy + b() 
1 
1 N\b@) ¥ = f (a2) = o(a2) = 1—-e-@% 
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Fig. 4.9: Output neuron of Fig.4.12 showing the full computation of the pre- 
activation and post-activation output 
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Figure 4.9 shows the forward propagation step and error propagation for the out- 
put neuron of Fig. 4.8. 


4.3.4 Backpropagation 


During forward propagation, an output prediction ¥ is computed for the input x and 
the network parameters @. To improve our prediction, we can use SGD to decrease 
the error of the whole network. Determining the error for each of the parameters 
can be done via the chain rule of calculus. We can use the chain rule of calculus to 
compute the derivatives of each layer (and operation) in the reverse order of forward 
propagation as seen in Fig. 4.10. 





Fig. 4.10: Visualization of backward propagation 


In our previous example, the prediction ¥ was dependent on W>. We can compute 
the prediction error with respect to W2, by using the chain rule: 
OE OE of 
——_ =. Oy (4.11) 
OW. OF AW2 
The chain rule allows us to compute the gradient of the error for each of the 
learnable parameters 0, allowing us to update the network using stochastic gradient 
descent. 
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We begin by computing the gradient on the output layer with respect to the pre- 
diction. 


OE 


VsE(S,y) = 75 





= (¥—y) (4.12) 


We can then compute error with respect to the layer 2 parameters. 
We currently have the “post-activation” gradient, so we need to compute the pre- 
activation gradient: 








E E y 
vee — 28 aE as 
dan OY da 
oF (4.13) 
= (Woh, +b 
er © f (Wah; +b) 
Now we can compute the error with respect to W> and bo. 
E E y 
Vyb — 2E. 28 a8 | das 
OW, Ov dar AW? 
(4.14) 
= OE ht 
— day I 
E E y 
Vy — OE AE a9 dm 
Ob» OF Oa Ob» (4 15) 
_ OE 
7 Oar 


We can also compute the error for the input to layer 2 (the post-activation output 
of layer 1). 


dE dE aS dm 
Ohy OF Oa Oh, 
OE 
— wl 
= 


Vnc = 


1 


(4.16) 


We then repeat this process to calculate the error for layer 1’s parameters W, and 
b,, thus propagating the error backward throughout the network. 

Figure 4.11 shows the backward propagation step for the output neuron of the 
network shown in Fig. 4.8. We leave numerical exploration and experimentation for 
our notebook exercises. 
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Fig. 4.11: Backpropagation through the output neuron 


4.3.5 Parameter Update 


The last step in the training process is the parameter update. After obtaining the 
gradients with respect to all learnable parameters in the network, we can complete 
a single SGD step, updating the parameters for each layer according to the learning 
rate . 


0=0—-aVoE (4.17) 


The simplicity of the SGD update rule presented here does come at a cost. The 
value of a is particularly vital in SGD and affects the speed of convergence, the 
quality of convergence, and even the ability for the network to converge at all. Too 
small of a learning rate and the network converges very slowly and can potentially 
get stuck in local minima near the random weight initialization. If the learning rate 
is too large, the weights may grow too quickly, becoming unstable and failing to 
converge at all. Furthermore, the selection of the learning rate depends on a combi- 
nation of factors such as network depth and normalization method. The simplicity of 
the network presented here alleviates the tedious nature of selecting a learning rate, 
but for deeper networks, this process can be much more difficult. The importance 
of choosing a good learning rate has led to an entire area of research around gradi- 
ent descent optimization algorithms. We discuss some of these techniques more in 
Sect. 4.4.3. 
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The overall process is described in Algorithm 1. 
Algorithm 1: Neural network training 


Data: Training Dataset D = {(x1,y1),(X1,Y2),---;(Xn,Yn)} 
Neural network with / layers with learnable parameters 
0 = ({W,,...W_}, {bi,...b;}) 
Activation function f(v) 
Learning rate @ 
Error function E(¥, v) 
Initialize neural network parameters 0 = ({Wj,,...W_}, {bi,...by}) 
for e < 1 to e epochs do 
for (x,y) in D do 
fori<+ 1toldo 
if i=/ then 
| hj =x 
aj = Wjhj_-1 +b; 
h; = f(ai) 
y=h, 
error = E(¥,y) 
Shj.1 = VyE(¥.y) 


fori</to1do 
Za, = Va,E = Shi. 0 f'(ai) 
Vwi = Sah}, 
Vp;E = Ba; 
Zn, = Vn, E = Ww) 8a; 


86=0—-aAVoE 


4.3.6 Universal Approximation Theorem 


Neural network architectures are applied to a variety of problems because of their 
representational power. The universal approximation theorem [HSW89] has shown 
that a feed-forward neural network with a single layer can approximate any contin- 
uous function with only limited restrictions on the number of neurons in the layer.! 
This theorem often gets summarized as “neural networks are universal approxima- 
tors.” Although this is technically true, the theorem does not provide any guarantees 
on the likelihood of learning a particular function. 


' The universal approximation theorem was initially proved for neural network architectures us- 
ing the sigmoid activation function, but was subsequently shown to apply to all fully connected 
networks [Cyb89b, HSW89]. 


154 4 Basics of Deep Learning 


The topography of the parameter space becomes more varied as machine learning 
problems become more complex. It is typically non-convex with many local min- 
ima. A simple gradient descent approach may struggle to learn the specific function. 
Instead, multiple layers of neurons are stacked consecutively and trained jointly with 
backpropagation. The network of layers then learns multiple non-linear functions to 
fit the training dataset. Deep learning refers to many neural network layers con- 
nected in sequence. 


4.4 Deep Learning 


The term “deep learning” is somewhat ambiguous. In many circles deep learning is 
a re-branding term for neural networks or is used to refer to neural networks with 
many consecutive (deep) layers. However, the number of layers to distinguish a deep 
network from a shallow network is relative. For example, would the neural network 
shown in Fig. 4.12 be considered deep or shallow? 
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Fig. 4.12: Feed-forward neural network with two hidden layers 


In general, deep networks are still neural networks (trained with backpropaga- 
tion, learning hierarchical abstractions of the input, optimized using gradient-based 
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learning), but typically with more layers. The distinguishing characteristic of deep 
learning is its application to problems previously infeasible to traditional methods 
and smaller neural networks, such as the MLP shown in Fig. 4.8. Deeper networks 
allow for more layers of hierarchical abstractions to be learned for the input data, 
thus becoming capable of learning higher-order functions in more complex domains. 
For this book however we utilize the term deep learning as described above—a neu- 
ral network with more than one hidden layer. 

The flexibility of neural networks is what makes them so compelling. Neural 
networks are applied to many types of problems given the simplicity and effective- 
ness of backpropagation and gradient-based optimization methods. In this section, 
we introduce additional methods and considerations that impact the architecture de- 
sign and model training for deep neural networks (DNN). In particular, we focus 
on activation functions, error functions, optimization methods, and regularization 
approaches. 


4.4.1 Activation Functions 
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Fig. 4.13: The step function performed perfectly adequate for the perceptron, how- 
ever its derivative makes it bad for gradient descent methods 


When computing the gradient of the output layer, it becomes apparent that the 
step function is not exactly helpful when trying to compute a gradient. As shown 
in Fig. 4.13, the derivative is 0 everywhere which means any gradient descent is 
useless. Therefore we wish to use a non-linear activation function that provides a 
meaningful derivative in the backpropagation process. 
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4.4.1.1 Sigmoid 


A better function to use as an activation function is the logistic sigmoid: 


1 


= l+e™~* 





(4.18) 


The sigmoid function is a useful activation for a variety of reasons. As we can 
see from the graph in Fig. 4.14, this function acts as a continuous squashing function 
that bounds its output in the range (0,1). It is similar to the step function but has a 
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Fig. 4.14: Sigmoid activation function and its derivative 


smooth, continuous derivative ideal for gradient descent methods. It is also zero- 
centered, creating a simple decision boundary for binary classification tasks, and 
the derivative of the sigmoid function is mathematically convenient: 


o' (x) = o(x)(1—o0(x)). (4.19) 


There are, however, some undesirable properties of the sigmoid function. 


e Saturation of the sigmoid gradients at the ends of the curve (very close to o(x) <+ 
0 or o(x) < 1) will cause the gradients to be very close to 0. As backpropagation 
continues for subsequent layers, the small gradient is multiplied by the post- 
activation output of the previous layer, forcing it smaller still. Preventing this 
can require careful initialization of the network weights or other regularization 
strategies. 

e The outputs of the sigmoid are not centered around QO, but instead around 0.5. 
This introduces a discrepancy between the layers because the outputs are not in a 
consistent range. This is often referred to as “internal covariate shift” which we 
will talk more about later. 
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4.4.1.2 Tanh 


The tanh function is another common activation function. It also acts as a squashing 
function, bounding its output in the range(—1, 1) as shown in Fig. 4.15. 


f(x) = tanh(x) (4.20) 
It can also be viewed as a scaled and shifted sigmoid. 
tanh(x) = 2*o0(2x) —1 (4.21) 


The tanh function solves one of the issues with the sigmoid non-linearity because 
it is zero-centered. However, we still have the same issue with the gradient saturation 


at the extremes of the function, shown in Fig. 4.16. 
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Fig. 4.15: Tanh activation function and its derivative 


4.4.1.3 ReLU 
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Fig. 4.16: ReLU activation function and its derivative 
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The rectified linear unit (ReLU) is a simple, fast activation function typically 
found in computer vision. The function is a linear threshold, defined as: 


f(x) = max(0,x). (4.22) 


This simple function has become popular because it has shown faster conver- 
gence compared to sigmoid and tanh, possibly due to its non-saturating gradient in 
the positive direction. 

In addition to faster convergence, the ReLU function is much faster computation- 
ally. The sigmoid and tanh functions require exponentials which take much longer 
than a simple max operation. 

One drawback from the simplicity of the gradient updates being O or | is that it 
can lead to neurons “dying” during training. If a large gradient is backpropagated 
through a neuron, the neuron’s output can become so affected that the update pre- 
vents the neuron from ever updating again. Some have shown that as many as 40% 
of the neurons in a network can “die” with the ReLU activation function if the learn- 
ing rate is set too high. 


4.4.1.4 Other Activation Functions 


Other activation functions have been incorporated to limit the effects of those pre- 
viously described, displayed in Fig. 4.17. 


e Hard tanh 
The hard tanh function is computationally cheaper than the tanh. It does, how- 
ever, re-introduce the disadvantage of gradient saturation at the extremes. 


f(x) = max(—1, min(1,x)) (4.23) 


e Leaky ReLU 
The Leaky ReLU introduces an @ parameter that allows small gradients to be 
backpropagated when the activation is not active, thus eliminating the “death” of 
neurons during training. 


x ifx>0 
P(x) = ax  ifx<0° es) 
e PRELU 
The parametric rectified linear unit, similar to the Leaky ReLU, uses an q@ pa- 
rameter to scale the slope of the negative portion of the input; however, an alpha 
parameter is learned for each neuron (doubling the number of learned weights). 
Note that when the value of @ = 0 this is the ReLU function and when the & is 
fixed, it 1s equivalent to the Leaky ReLU. 


x ifx>0 
PO) = ax  ifx<0° em) 
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ELU 
The ELU is a modification of the ReLU that allows the mean of activations to 
push closer to 0, which therefore potentially speeds up convergence. 


x ifx >0 
$0) = Facer) ifx <0" ae) 


Maxout 

The maxout function takes a different approach to activation functions. It differs 
from the element-wise application of a function to each neuron output. Instead, 
it learns two weight matrices and takes the highest output for each element. 


f(x) = max(wix+b1,w2x + bo) (4.27) 
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Fig. 4.17: Additional activation function. (a) Hard tanh. (b) Leaky ReLU. (c) 
PReLU. (d) ELU 
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4.4.1.5 Softmax 


The squashing concept of the sigmoid function is extended to multiple classes by 
way of the softmax function. The softmax function allows us to output a categorical 
probability distribution over K classes. 





f (xi) = Se (4.28) 


We can use the softmax to produce a vector of probabilities according to the 
output of that neuron. In the case of a classification problem that has K = 3 classes, 
the final layer of our network will be a fully connected layer with an output of 
three neurons. If we apply the softmax function to the output of the last layer, we 
get a probability for each class by assigning a class to each neuron. The softmax 
computation is shown in Fig. 4.18. 

The softmax probabilities can become very small, especially when there are 
many classes and the predictions become more confident. Most of the time a log- 
based softmax function is used to avoid underflow errors. The softmax function is a 
particular case for activation functions, in that it is rarely seen as an activation that 
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Fig. 4.18: The output of a neural network can be mapped to a multi-class classi- 
fication task (three classes shown here). The softmax function maps the real-value 
network output to a probability distribution over the number of classes, where the 
number of classes equals the number of neurons in the final layer 


occurs between layers. Therefore, the softmax is often treated as the last layer of a 
network for multiclass classification rather than an activation function. 


4.4.1.6 Hierarchical Softmax 


As the number of classes begins to grow, as is often the case in language tasks, 
the computation of the softmax function can become expensive to compute. For 
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example, in a language modeling task, our output layer may be trying to predict 
which word will be next in the sequence. Therefore, the output of the network would 
be a probability distribution over the number of terms in our vocabulary, which 
could be thousands or hundreds of thousands. The Hierarchical Softmax [MBO5] 
approximates the softmax function by representing the function as a binary tree 
with the depth yielding less probable class activations. The tree must be balanced as 
the network is training, but it will have a depth of log,(K) where K is the number 
of classes, which means only log,(K) states need to be evaluated to compute the 
output probability of a class. 


4.4.2 Loss Functions 


Another important aspect of training neural networks is the choice of error functions 
often referred to as the criteria/criterion. The selection of the error function depends 
on the type of problem being addressed. For a classification problem, we want to 
predict a probability distribution over a set of classes. In regression problems, how- 
ever, we want to predict a specific value rather than a distribution. We present the 
basic, most commonly used loss functions here. 


4.4.2.1 Mean Squared (2) Error 


Mean squared error(MSE) computes the squared error between the classification 
prediction and the target. Training with it minimizes the difference in magnitude. 
One drawback to MSE is that it is susceptible to outliers since the difference is 
squared. 


| n 
E(§,y) =— Mii-Si)” (4.29) 


So far, we have been using the MSE or L» for its simplicity as the loss for a binary 
classification problems, classifying it as a O if ¥ > 0.5 or 1 if ¥ < 0.5; however, it is 
typically used for regression problems and could be easily extended for the simple 
problems that we have been working with. 


4.4.2.2 Mean Absolute (L;) Error 
Mean absolute error gives a measure of the absolute difference between the target 


value and prediction. Using it minimizes the magnitude of the error without consid- 
ering direction, making it less sensitive to outliers. 


a 1X : 
E(¥.y) = — DiS (4.30) 
i=| 
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4.4.2.3 Negative Log Likelihood 


Negative log likelihood (NLL), is the most common loss function used for multi- 
class classification problems. It is also known as the multiclass cross-entropy loss. 
The softmax provides a probability distribution over the output classes. The entropy 
computation is a weighted-average log probability over the possible events or clas- 
sifications in a multiclass classification problem. This causes the loss to increase as 
the probability distribution of the prediction diverges from the target label. 


n 


E(9.y) = ~~ ¥ilog(s:) — (1 ~yi)log(1 — $i) (4.31) 
i=1 


4.4.2.4 Hinge Loss 


The hinge loss is a max-margin loss classification taken from the SVM loss. It at- 
tempts to separate data points between classes by maximizing the margin between 
them. Although it is not differentiable, it is convex, which makes it useful to work 
with as a loss function. 


= ¥ max(0, 1 — yi) (4.32) 


4.4.2.5 Kullback—Leibler (KL) Loss 


Additionally, we can optimize on functions, such as the KL-divergence, which mea- 
sures a distance metric in a continuous space. This is useful for problems like gen- 
erative networks with continuous output distributions. The KL-divergence error can 
be described by: 


ss 1X Z 
E(§,y) = — d) Dex(vil |i) 
i=l 


(4.33) 
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| nN 
(yi -log(yi)) — =e (yj -log(9,)) 
i=1 i=1 


4.4.3 Optimization Methods 


The training process of neural networks is based on gradient descent methods, 
specifically SGD. However, as we have seen in the previous section, SGD can cause 
many undesirable difficulties during the training process. We will explore additional 
optimization methods in addition to SGD and the benefits associated with them. We 
consider all learnable parameters including weights and biases as @. 
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4.4.3.1 Stochastic Gradient Descent 


As presented in Chap. 2, stochastic gradient descent is the process of making updates 
to a set of weights in the direction of the gradient to reduce the error. In Algorithm 7, 
SGD’s update rule was the simple form: 


0,44 — 0, = AV goE. (4.34) 


where @ represents the learnable parameters, a is the learning rate, and VgE is 
the gradient of the error with respect to the parameters. 


4.4.3.2 Momentum 


One issue that commonly arises with SGD is that there are areas of feature space that 
have long shallow ravines, leading up to the minima. SGD will oscillate back and 
forth across the ravine because the gradient will point down the steepest gradient 
on one of the sides rather than in the direction of the minima. Thus, SGD can yield 
slow convergence. 

Momentum is one modification of SGD to move the objective more quickly to 
the minima. The parameter update equation for momentum is 


Ve = W-1+NVoE 


(4.35) 
O41 =9—V 


where @, represents a parameter at iteration f. 

Momentum, taking its inspiration from physics computes a velocity vector cap- 
turing the cumulative direction that previous gradients have yielded. This velocity 
vector is scaled by an additional hyper-parameter 1, which suggests how heavily 
the cumulative velocity can contribute to the update. 


4.4.3.3 Adagrad 


Adagrad [DHS11] is an adaptive gradient-based optimization method. It adapts the 
learning rate to each of the parameters in the network, making more substantial up- 
dates to infrequent parameters, and smaller updates to frequent ones. This makes 
it particularly useful for learning problems with sparse data [PSM14]. Perhaps the 
most significant benefit of adagrad is that it removes the need to tune the learning 
rate manually. This does, however, come at the cost of having an additional param- 
eter for every parameter in the network. 
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The adagrad equation is given by: 
81,5 = VoE(@,,i) 
Oni — 0; _— a 


; Gi +E LE 0 8t,i 
where g; is the gradient at time t along each component of @, G; is the diagonal 
matrix of the sum of up to rf time steps of past gradients w.r.t. to all parameters 0 
on the diagonal, 7) is the general learning rate, and € is a smoothing term (usually 
le — 8) that keeps the equation from dividing by zero. 

The main drawback to adagrad is that the accumulation of the squared gradients 
is positive, causing the sum to grow, shrinking the learning rate, and stopping the 
model from further learning. Additional variants, such as Adadelta [Zeil2], have 
been introduced to alleviate this problem. 


(4.36) 


4.4.3.4 RMS-Prop 


RMS-prop [TH12] developed by Hinton was also introduced to solve the inadequa- 
cies of adagrad. It also divides the learning rate by an average of squared gradients, 
but it also decays this quantity exponentially. 


Elg"|; = pE[g*],-1 + (1 —p)g? 
" (4.37) 
i., = & — ———— 
t+1 t Ee)te 


where p = 0.9 and the learning rate 7 = 0.001 is suggested in the presented lecture. 


4.4.3.5 ADAM 


Adaptive moment estimation, referred to as Adam [KB 14] is another adaptive opti- 
mization method. It too computes learning rates for each parameter, but in addition 
to keeping an exponentially decaying average of the previous squared gradients, 
similar to momentum, it also incorporates an average of past gradients m,. 


Mm = By my—1 5 (1 — Bi )g; 
v, = Bov;-1 + (1 — Bo) g? 
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Empirical results show that Adam works well in practice in comparison with 
other gradient-based optimization techniques. 

While Adam has been a popular technique, some criticisms of the original 
proof have surfaced showing convergence to sub-optimal minima in some situa- 
tions [BGW18, RKK18]. Each work proposes a solution to the issue, however the 
subsequent methods remain less popular than the original Adam technique. 


4.5 Model Training 


Achieving the best generalization error (best performance on the test set) is the main 
goal for machine learning, which requires finding the best position on the spectrum 
between overfitting and underfitting. Deep learning is more prone to overfitting. 
With many free parameters, it can be relatively easy to find a path to achieve E = 0. 
It has been shown that many standard deep learning architectures can be trained on 
random labeling of the training data and achieve E = O [Zha+16]. 

In contrast to overfitting, for many complex functions there are diverse local 
minima that may not be the optimal solution, and it is common to settle in a local 
minima. Deep learning relies on finding a solution to a non-convex optimization 
problem which is NP-complete for a general non-convex function [MK87]. In prac- 
tice, we see that computing the global minimum for a well-regularized deep network 
is mostly irrelevant because local minima are usually roughly similar and get closer 
to the global minimum as the complexity of the model increases [Cho+15a]. In a 
poorly regularized network, however, the local minima may yield a high loss, which 
is undesirable. 

The best model is one that achieves the smallest gap between its training loss 
and validation loss; however, selecting the correct architecture configuration and 
training technique can be taxing. Here we discuss typical training and regularization 
techniques to improve model generalization. 


4.5.1 Early Stopping 


One of the more practical ways that we can prevent a model from overfitting is “early 
stopping.” Early stopping hinges on the assumption: “As validation error decreases, 
test error should also decrease.” When training we compute the validation error at 
distinct points (usually at the end of each epoch) and keep the model with the lowest 
validation error, as shown in Fig. 4.19. 
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The learning curve shows that the training error will continue to decrease towards 
0. However, the model begins to perform worse on the validation set as it overfits 
to the training data. Therefore, to maintain the generalization of the model on the 
test set, the model (learned parameters of model) that performed best on our vali- 
dation set would be selected. It is also important to point out here that this requires 
a dataset that is split into training, validation, and testing sets with no overlap. The 
test set should be kept separate from the training and validation, as, otherwise, this 
compromises the integrity of the model. 

The simplicity of early stopping makes it the most commonly used form of reg- 
ularization in deep learning. 


4.5.2 Vanishing/Exploding Gradients 


When training neural networks with many layers with backpropagation, the issue of 
vanishing/exploding gradients arises. During backpropagation, we are multiplying 
the gradient by the output of each successive layer. This means that the gradient can 
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Fig. 4.19: Early stopping point is shown when validation error begins to diverge 
from training error 


get larger and larger if VE > 1 or VE < 1 and smaller if the gradient is 1 < VE < 0 
as it is multiplied by each successive layer. Practically, this means, in the case of 
vanishing gradients, very little of the error is propagated back to the earlier layers 
of the network causing learning to be very slow or nonexistent. For exploding gra- 
dients, this causes the weights to eventually overflow which prevents learning. The 
deeper a neural network becomes, the greater a problem this becomes. 
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In the case of exploding gradients a simple, practical solution is to “clip” the 
gradients setting a maximum for the gradient values at each backpropagation step 
to control the growth of the weights. We revisit this topic when addressing recurrent 
neural networks. 


4.5.3 Full-Batch and Mint-Batch Gradient Decent 


Batch gradient decent is a variant of gradient descent that evaluates the error on 
the whole dataset before updating the model by accumulating the error after each 
example. This alleviates some of the problems of SGD, such as the noise introduced 
from each example, but the frequency of the updates can cause a higher variance 
between training epochs, which can create significant differences in the models. 
This approach is rarely used in practice with deep learning. 

A suitable compromise between these two strategies is mini-batch gradient de- 
scent. Mini-batch gradient descent splits the dataset into batches, and the model 
accumulates the error over a mini-batch before making an update. This approach 
provides a variety of advantages, including: 


e Reduced noise in each model update due to accumulating the gradients from 
multiple training examples 
Greater efficiency than SGD 
Faster training by taking advantages of matrix operations to reduce IO time 


One downside of mini-batch gradient descent is the addition of the mini-batch 
size as a hyperparameter. The mini-batch size, often just called “batch” size for con- 
venience, is usually set based on the model’s hardware limitations to not exceed the 
memory of either the CPU or GPU. Additionally, batch sizes are typically powers of 
2 (8, 16, 32, etc.) due to common hardware implementations. In general, it is desir- 
able to strike a balance with a small batch size yielding a quicker convergence and 
a larger batch size which converges more slowly but with more accurate estimates. 
It is recommended to review the learning curves of a few different batch sizes to 
decide on the best size. 


4.5.4 Regularization 


Practically, controlling the generalization error is achieved by creating a large model 
that is appropriately regularized [GBC16a, Bis95]. Regularization can take many 
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forms. Some methods focus on reducing the capacity of the models by penalizing 
the abnormal parameters in the objective function by adding a regularization term 


E(W;9,y) =E(§,y) + Q(W) (4.39) 


where W is the weights of the network. Some approaches focus on limiting the in- 
formation provided to the network (e.g., dropout) or normalizing the output of layers 
(batch normalization), while others may make changes to the data directly. Here we 
will explore a variety of regularization methods, and it is typically suggested to 
incorporate multiple into every problem. 


4.5.4.1 L, Regularization: Weight Decay 


One of the most common regularization methods is the Lz regularization method, 
commonly referred to as weight decay. Weight decay adds a regularization term 
to the error function that pushes the weights towards the origin, penalizing high 
weight variations. Weight decay introduces a scalar & that penalizes weights moving 
away from the origin. This functions as a zero-mean Gaussian prior on the training 
objective, limiting the freedom of the network to learn large weights that might be 
associated with overfitting. The setting of this parameter becomes quite important 
because if the model is too constrained, it may be unable to learn. 
[7 regularization is defined as: 


Q(w) = SW (4.40) 


The loss function can then be described as: 


ix a x 
E(WY,y) = ~W'W+ EV,y). (4.41) 
With the gradient being: 
VwE(W;¥,y) = aW+ VwEVY,y) (4.42) 


And the parameter update becomes: 
W=W-e(aw-+ VwE(¥, y)), (4.43) 


where € is the learning rate. 


4.5.4.2 L,; Regularization 


A less common regularization method is L; regularization. This technique also func- 
tions as a weight penalization. The regularizer is a sum of the absolute values of the 
weights: 
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Q(w)=a>'|wi (4.44) 


As training progresses many of the weights will become zero, introducing spar- 
sity into the model weights. This is often used in feature selection but is not always 
a desirable quality with neural networks. 


4.5.4.3 Dropout 


Perhaps the second-most common regularization method in deep learning is 
Dropout [Sri+14]. Dropout has been a simple and highly effective method to reduce 
overfitting of neural networks. It stems from the idea that neural networks can have 
very fragile connections from the input to the output. These learned connections 
may work for the training data but do not generalize to the test data. Dropout aims 
to correct this tendency by randomly “dropping out” connections in the neural 
network training process so that a prediction cannot depend on any single neuron 
during training, as illustrated in Fig. 4.20. 

Applying dropout to a network involves applying a random mask sampled from a 
Bernoulli distribution with a probability of p. This mask matrix is applied element- 
wise (multiplication by 0) during the feed-forward operation. During the backprop- 
agation step, the gradients for each parameter and the parameters that were masked 
the gradient are set to O and other gradients are scaled up by et 
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Fig. 4.20: Dropout when applied to a fully connected neural network. (a) Standard 
2-layer (hidden) neural network. (b) Standard 2-layer (hidden) neural network with 
dropout 


170 4 Basics of Deep Learning 


4.5.4.4 Multitask Learning 


In all machine learning tasks, we are optimizing for a specific error metric or func- 
tion. Therefore, to perform well on various tasks simultaneously, we usually train 
a model for each metric and then ensemble, linearly combine, or connect them in 
some other meaningful way to perform well on our overall collection of tasks. Be- 
cause deep learning is achieved via gradient-based computation and descent, we 
can simultaneously optimize for a variety of optimization functions. This allows 
our underlying representation to learn a general representation that can accomplish 
multiple tasks. Multitask learning has become a widely used approach recently. The 
addition of auxiliary tasks can help improve the gradient signal to the learned pa- 
rameters leading to better quality on the overall task [Rud17a]. 


4.5.4.5 Parameter Sharing 


Another form of regularization is parameter sharing. So far we have only considered 
fully connected neural networks, which learn an individual weight for every input. 
In some tasks the inputs are similar enough that it is undesirable to learn a different 
set of parameters for each task, but rather share the learnings in multiple places. This 
can be accomplished by sharing a set of weights across different inputs. Parameter 
sharing is not only useful as a regularizer, but also provides multiple training benefits 
such as reduced memory (one copy of a set of weights) and a reduced number of 
unique model parameters. 

One approach that leverages parameter sharing is a convolutional neural network, 
which we explore in Chap. 6. 


4.5.4.6 Batch Normalization 


During the process of training, there may be a lot of variation in the training exam- 
ples leading to the introduction of noise in the training process. One of the ways that 
we recommended in the introduction is normalizing our data before training. Nor- 
malization reduces the amount that weights need to shift to accommodate a specific 
example, maintaining the same distribution properties. With deep learning we have 
multiple layers of computation with hidden values that are passed to subsequent lay- 
ers. The output of each of these layers is likely to be a non-normalized input, and the 
distribution is likely to change frequently during the training process. This process 
is commonly referred to as “internal covariate shift.’ Batch normalization [IS15] 
aims to reduce internal covariate shift in a network by normalizing the outputs of 
intermediate layers during training. This speeds the training process and allows for 
higher learning rates without risking divergence. 

Batch normalization achieves this by normalizing the output of the previous hid- 
den layer by the batch’s (mini-batch’s) mean and variance. This normalization, how- 
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ever, would affect the inference phase, and, thus, batch normalization captures a 
moving average of the mean and variance and fixes them at inference time. 
For an input mini-batch B = {x1.,,}, we learn parameters y and B via: 


Ug = 


Xj 


| m 
m 


i=1 (4.45) 


2 
Ogre 


yi = Yat B. 


4.5.5 Hyperparameter Selection 


Most learning techniques and regularization methods have some form of training 
configuration parameters associated with them. Learning rate, momentum, dropout 
probability, and weight decay, for example, all need to be selected for each model. 
Selecting the best combination of these hyperparameters can be a challenging task. 


4.5.5.1 Manual Tuning 


Manual hyperparameter tuning is recommended when applying an existing model 
to a new dataset to an existing model or new model to an existing dataset. Manual 
selection helps provide intuition about the network. This can be useful to under- 
stand if a particular set of parameters will cause the network to overfit or underfit. 
It is advised to monitor the norm of the gradients, and how quickly a model’s loss 
converges or diverges. In general, the learning rate is the most important hyperpa- 
rameter, having the most impact on the effective capacity of the network [GBC16b]. 
Selecting the right learning rate for a model will allow good convergence, and early 
stopping will prevent the model from overfitting to the training set. If the learning 
rate is too high, large gradients can cause the network to diverge preventing future 
learning in some cases (even when the learning rate becomes lower). If the learning 
rate is too low, small updates will slow the learning process and can also cause the 
model to settle into a local minimum with a high training and generalization error. 


4.5.5.2 Automated Tuning 


Automatic hyperparameter selection is a much faster and robust method for op- 
timizing a training configuration. Grid search, introduced in Chap. 2, is the most 
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common and straightforward technique. In a grid search, uniform or logarithmic 
samples are provided for each parameter to be optimized, and a model is trained for 
every combination of parameters. This approach is effective, however it does require 
a significant amount of computation time to train the set of models. Typically, this 
cost can be reduced by investigating large ranges first and then narrowing the set of 
parameters or ranges, performing another grid search with the new ranges. 

Random hyperparameter search is sometimes more robust to the nuances of train- 
ing, as some combinations of hyperparameters can have cumulative effect. Similar 
to a grid search, random search randomly samples values in the range of the grid 
search rather than evenly spaced samples. This has shown to consistently outper- 
form grid search as there are spaces of the hyperparameter grid that are unexplored 
(given the same number of parameter combinations). 

Typically, the majority of the models explored with grid search and random 
search are subject to poor combinations. This can be alleviated to some degree by 
setting appropriate bounds for the search gleaned from manual exploration, how- 
ever ideally the performance of the model can be used to determine the next set of 
parameters. Various conditioned and Bayesian hyperparameter selection procedures 
have been introduced to accomplish this [SLA12]. 


4.5.6 Data Availability and Quality 


Regularization is the most common technique to prevent overfitting, but it can also 
be accomplished with increasing the amount of data. Data is the most important 
component of any machine learning model. Although it may seem obvious, this is 
often one of the most overlooked components in real-world scenarios. Abstractly, 
neural networks are learning from the experiences they encounter. In binary classifi- 
cation, for example, the positive example-label pairs are encouraged, while negative 
pairs are discouraged. Tuning the neural network’s hyperparameters is typically the 
best appropriate step to improve generalization error. If a performance gap still ex- 
ists between the training and generalization error, it may be necessary to increase 
the amount of data (or quality in some cases). 

Neural networks can be robust to some amount of noise in a dataset, and during 
the training process, the effects of outliers are typically lessened. However, erro- 
neous data can cause many issues. Poor model performance in real-world applica- 
tions can be caused by consistently incorrect labels or insufficient data. 


In real-world applications, if there seems to be odd behavior throughout the 
training process, it may be a sign of data inconsistencies. 


This typically manifests itself in one of two ways: overfitting or poor conver- 
gence. In the case of overfitting, the model may learn an anomaly of the data (such 
as the presence of a user name in many negative sentiment reviews). 
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Deep learning in particular benefits more from larger datasets than other machine 
learning algorithms. Much of the quality improvements achieved by deep learning 
are directly attributable to the increase in the size of datasets used. Large datasets 
can act as a regularization technique to prevent a model from overfitting to specific 
examples. 


4.5.6.1 Data Augmentation 


One of the easiest ways to improve model performance is to introduce more train- 
ing data. Practically, this can be expensive, but if the data can be augmented in a 
meaningful way, this method can be quite useful. This technique can be particularly 
beneficial to reduce over-fitting to specific anomalies in a dataset. 

In the case of images, we can imagine rotating and horizontal flipping as creat- 
ing a different (X,y) pair, without having to re-label any data. This, however, would 
not be the case for handwritten numbers, where a horizontal flip might corrupt the 
interpretation of the label (think of the 5 and the 2). When incorporating data aug- 
mentation, make sure to keep the constraints of the example and target relationship 
in mind. 


4.5.6.2 Bagging 


Bagging is another technique commonly used in machine learning. This technique 
is based on the idea that we can reduce the ability for models to overfit by training 
multiple models on different portions of the training set. The bagging technique 
samples from the original dataset (with replacement), creating subtraining sets on 
which models are trained. The models should learn different features since they are 
learning different portions of the data, leading to a lower generalization error after 
combining the results from each model. This strategy tends to be used less often 
in practice due to the computation time of deep learning models, the large data 
requirements of deep models, and the introduction of other regularization methods 
(like Dropout). 


4.5.6.3 Adversarial Training 


Adversarial examples are examples designed to cause a classifier to misclassify the 
example. The free parameter space of neural networks means that we can find spe- 
cific input examples that can take advantage of the specific set of trained parameters 
within a model [GSS 14]. 

Because of the properties of adversarial examples, we can use the techniques 
used to create adversarial examples to produce training data for the network to 
reduce the likelihood of success of a particular attack, as well as improve the ro- 
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bustness of the network by providing training examples that focus on the areas of 
uncertainty in the parameter space. 


4.5.7 Discussion 


Broadly speaking, there are typically four pillars in tension when configuring 
and training neural networks: 


e Data availability (and quality) 
e Computation speed 

e Memory requirements 

e Quality 


In practice, it is generally a good idea to establish the end goal and work backwards 
to figure out the boundaries for each of the constraints. 

Generally speaking, the initial stage of model selection ensures the model has the 
capacity to learn reliably. This inevitably leads to overfitting on the training dataset, 
at which point regularization is introduced to decrease the gap between the training 
loss and validation loss. In practice, it is usually not necessary (nor feasible) to start 
from scratch for each new model type or task. However, we believe introducing 
complexity gradually is best with highly dynamic systems. It is common to start 
with empirically verified architecture sizes and apply regularization directly from 
the beginning, however it is best to remove complexity when unexpected situations 
arise. 


4.5.7.1 Computation and Memory Constraints 


While numerous advancements made deep learning possible, one of the most sig- 
nificant contributors to the recent growth in adoption is undoubtedly hardware im- 
provements, particularly specialized computer architectures (GPUs). The process- 
ing speeds accomplished with GPUs have been among one of the most significant 
contributing factors to the popularity and practicality of deep learning. Speed ad- 
vantages through matrix optimizations and the ability to batch compute make the 
problems of deep learning ideal for GPU architectures. This development made it 
possible to move beyond shallow architectures to the deep, complex architectures 
that we see today. 

Large datasets and deep learning architectures have led to significant quality im- 
provements; however, the computational cost of deep learning models is typically 
higher other machine learning methods, which needs to be considered in limited re- 
source environments (such as mobile devices). The model requirements also impact 
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the amount of hyperparameter optimization that can be done. It is unlikely that a full 
grid search can be performed for models that take days or weeks to train. 

The same reasoning applies to memory concerns, with larger models requir- 
ing more space. Although, many quantization techniques are being introduced to 
shrink model sizes, such as quantizing parameters or using hashing parameter val- 
ues [Jou+16b]. 


4.6 Unsupervised Deep Learning 


So far, we have examined examples of feed-forward neural networks for supervised 
learning. We will now look at some other architectures that extend neural networks 
and deep learning to unsupervised tasks by looking at three common unsupervised 
architectures: Restricted Boltzmann machines (RBM), deep belief networks, and 
autoencoders. We will build on our current knowledge by analyzing some simple 
architectures that accomplish tasks other than classification. 

As discussed in Chap. 2, unsupervised models learn representations, and these 
features form data without labels. This is usually a very desirable property because 
unlabeled data is readily available at large volumes. 


4.6.1 Energy-Based Models 


Energy-based models (EBMs) gain their inspiration from physics. The free energy 
in a system can be correlated with the probability of an observation. High energies 
are associated with a low probability observation and low energies are associated 
with a high probability observation. Thus, in EBMs, the aim is to learn an energy 
function that results in low energies for observed examples from the dataset and 
higher energies for unobserved examples [LeC+06]. 

For an energy-based model, a probability distribution is defined through the en- 
ergy function, similar to: 





p(x) = (4.46) 
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where Z is the normalization constant, commonly referred to as the partition func- 
tion. 


Zaye (4.47) 


The partition function is intractable for many algorithms, as it requires an ex- 
ponential sum over all the possible combinations of the input x as defined by the 
distribution P. However, it can be approximated as we will see in the case of RBMs. 

Learning useful features requires learning the weight of our input x and also a 
hidden portion h. Thus, the probability of an observation x can be written as: 














—E(x) 
P(x) = P(x,4)=>.5 (4.48) 
h h 
Free energy is defined as: 
F(x) = —log Ye EO"), (4.49) 
h 
and the negative log-likelihood gradient: 
log p(x) OF (x) ., OF (x) 
— = = — , 4. 
a0 a0 LP) a0 ee 


This function yields a negative log-likelihood gradient with two parts, commonly 
referred to as the positive phase and negative phase. The positive phase increases the 
probability of training data. 


4.6.2 Restricted Boltzmann Machines 


The restricted Boltzmann machine [HS06] is a technique for using log-linear 
Markov random field (MRF) to model the energy function for unsupervised learn- 
ing. The RBM 1s, as the name suggests, a restricted form of the Boltzmann machine 
[HS83], which provides some useful constraints on the architecture to improve the 
tractability and convergence of the algorithm. The RBM limits the connectivity of 
the network, as shown in Fig. 4.21, allowing only visible-hidden connections. This 
modification allows for more efficient training algorithms, such as gradient-based 
Contrastive Divergence. 
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Fig. 4.21: Illustration of an RBM. Note that this can be seen as a fully connected 
layer as shown previously with only visible-hidden connections in the network. Con- 
nections are only shown for one visible neuron and one hidden neuron for the sake 
of clarity 


The energy function of the RBM is defined as: 


E(x,h) = —h'Wx-—c'x—b'h (4.51) 


where W represents the weight matrix connecting the visible units and the hidden 
units, b is the bias of the hidden unit, and c is the bias of the probability for each xj. 
We then get the probability from the energy function: 





oe E(x h) 
h) = 4.52 
p(x,h) 7 (4.52) 
Furthermore, if x, € {0,1}, we can further reduce the equation to: 
h, = 1|)x) = o(b; + Wix 
Phi Ix) (di (4.53) 
p(x; = Ih) = o(c; + W'h) 
where o is the sigmoid function. 
The free energy formula therefore, becomes: 
F(x) =—elx— Ylog(1 +677) (4.54) 


l 


We can then compute the gradients for the RBM as: 
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—_— = E,| p(hi|x)x;| — O(C; = Wx) 

—_ oe ~ E,[p(h;|x)| — 6(W;x) (4.55) 
a1 

— TERS = esfolsih)]— 


Once we have samples of the function p(x), we can run a Markov chain with 
Gibbs sampling. 


4.6.3 Deep Belief Networks 


The effectiveness of RBMs showed that these architectures can be stacked and 
trained together to create a deep belief network (DBN) [HOTO6b]. Each sub- 
network is trained in isolation, with a hidden layer serving as the visible layer to 
the next network. The concept of this layer-by-layer training led to one of the first 
effective approaches to deep learning. A deep belief network is shown in Fig. 4.22. 
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Fig. 4.22: Illustration of a three layer, deep belief network. Each rbm layer is trained 
individually, starting with the lowest layer 


4.6.4 Autoencoders 


The autoencoder is an unsupervised deep learning approach to perform dimension- 
ality reduction on each set of data. The aim is to learn a lower dimensional repre- 
sentation of the input data by training one encoder to reduce the dimensionality of 
the data and another decoder to reproduce the input. The autoencoder is a neural 
network that is trained to reproduce its input rather than predict a class. The learned 
representation contains the same information as the input in a smaller, compressed 
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vector, learning what is most important for the reconstruction to minimize a recon- 
struction error. 

The autoencoder is split into two components, the encoder and the decoder. The 
encoder converts the input, x into an embedding,” z. The decoder maps the encoding, 
z, back to the original input x. Thus, for a neural network encoder, Enc(x), and 
decoder, Dec(z), the loss £ (mean squared error) is minimized by: 


& = Dec(z) (4.56) 


An illustration of the autoencoder architecture is shown in Fig. 4.23. 
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Fig. 4.23: Architecture diagram of an autoencoder with six input values and an em- 
bedding of size 4 


Training an autoencoder is very similar to other neural network architectures for 
classification, except for the loss function. Whereas the softmax was previously used 
to predict a distribution over a set of classes, we now want to produce a real-valued 
output that can be compared with the input. This is exactly what we accomplished 
with the MSE objective function that we used previously and is primarily used for 
autoencoders.” 

The training of this network is the same as defined in Algorithm 1 with an oc- 
casional difference. Many times in autoencoders, it is beneficial to tie the weights 


* This output of the encoder is sometimes referred to as the code, encoding or embedding. 


> If the task has real-valued inputs between 0 and 1, then Bernoulli cross-entropy is a better choice 
for the objective function. 
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together between the encoder and decoder, with the decoder weights W* = W! In 
this scenario, the gradients for the weights W will be the sum of two gradients, one 
from the encoder and one from the decoder. 

In general, there are four types of autoencoders: 


Undercomplete autoencoders (standard) 
Sparse 

Denoising autoencoder 

Variational autoencoders (VAE) 


with variants of each depending on the application. 


4.6.4.1 Undercomplete Autoencoders 


An undercomplete autoencoder is the most common type. As shown in Fig. 4.23, the 
encoder narrows the network to produce an encoding that is smaller than the input. 
This operates as a learned dimensionality reduction technique. Ideally, the encoder 
learns to compress the most essential information into the encoding, so the decoder 
can reconstruct the input. 


4.6.4.2 Denoising Autoencoders 


A denoising autoencoder takes a noisy input and attempts to decode to a noiseless 
output. The learned representation will be less sensitive to noise perturbations in the 
input. 

For a noise function* N(x), the autoencoder can be described as: 


x’ = N(x) 
Z = Enc(x’) asa 
&’ = Dec(z) 


£(x,8") = |[x-8'|/ 


4.6.4.3 Sparse Autoencoders 


Sparse autoencoders rely on a minimum threshold of the activations to enforce spar- 
sity in the encoding, rather than relying on a bottleneck of the encoder. In this sce- 
nario, the encoder can have larger hidden layers than the input, and sparsity can 
be achieved by setting a minimum threshold for a neuron, zeroing the outputs for 
neurons below the threshold. 

One way to train a sparse autoencoder is to add a term to the loss such as L; to 
penalize the output activations in the encoder. For a single layer encoder, the loss 


+ Note: there are no learned parameters in the noise function presented here. 
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function can be described as £(x,%) = ||x —&||/? + A ¥|z;|, where A sets the weight 


of the sparsity. 


4.6.4.4 Variational Autoencoders 


Variational autoencoders describe the latent space in terms of probability distribu- 
tions. The encoding that has been learned by autoencoders so far describes a sample 
drawn from some latent space, determined by the encoder. Instead of each value 
of the encoding being represented by a single value as the other autoencoders have 
done so far, the variational autoencoder learns to represent the encoding as latent 
distributions. The parameters are typically learned with respect to the Gaussian dis- 
tribution in that two parameters must be learned: the mean, LU, and the standard devi- 
ation, o. The decoder is trained on samples, referred to as “sampled latent vectors,” 
drawn from a random distribution parameterized by the learned and o values. A 
diagram of a VAE is shown in Fig. 4.24. 
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Fig. 4.24: The variational autoencoder learns a vector of means, LW, and a vector of 
standard deviations, o. The sampled latent vector, z, is computed by z= U+O00€ 
where € is sampled from a normal distribution, V(0, 1) 


A problem arises when trying to backpropagate through a stochastic operation 
of sampling from the Gaussian distribution. The computation is in the path of the 
forward propagation, and the gradient for the sampling must be computed to obtain 
gradients for the encoder; however, the stochastic operation does not have a well- 
defined gradient. The reparameterization trick [JGP16] offers a way to rewrite the 
sampling procedure to make the stochastic element independent of the learned uU 
and o parameters. The sampling of latent variable z is changed from: 
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z=N(p,07) (4.58) 
to the reparameterized: 
Z=U+O€, (4.59) 


where € is sampled from a normal distribution, N(0,1). Now, although € is still 
stochastic, W and o do not depend on it for backpropagation. 

Training the VAE requires optimizing two components in the loss function. The 
first component is the reconstruction error that we have optimized for normal au- 
toencoders and the second part is the KL-divergence. The KL-divergence loss en- 
sures the learned mean and variance parameters stay close to N(0, 1). 

The overall loss is defined as: 


L(x,8) + Dax (9j(2lx) IP): (4.60) 


where Dx is the KL-divergence, p(z) is prior distribution, and q;(z|x) is the learned 
distribution. 


4.6.5 Sparse Coding 


Sparse coding [Mai+10] aims to learn a set of basis vectors to represent the data. 
These basis vectors can then be used to form linear combinations to represent the 
input x. The technique of learning basis vectors to represent our data is similar to 
techniques like PCA that we explored in Chap. 2. However, with sparse coding, we 
instead learn an over-complete set that will allow the learning of a variety of patterns 
and structures within the data. 

Sparse coding itself is not a neural network algorithm, but we can add a penalty to 
our network to enforce sparsity of an autoencoder that creates a sparse autoencoder. 
This is merely the addition of an L; penalty to the loss function that forces most of 
weights to be 0. 


4.6.6 Generative Adversarial Networks 


Generative adversarial networks (GAN) [Goo+14a] are an unsupervised technique 
that structures the learning procedure like a zero-sum game. The technique uses 
two neural networks referred to as the generator and the discriminator. The genera- 
tor provides a generated example to the discriminator network, often drawn from a 
latent space or distribution. The discriminator must discern whether the provided ex- 
ample is a generated (fake) example or an actual example from dataset/ distribution. 
An illustration of a GAN 1s shown in Fig. 4.25. 
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Fig. 4.25: Hlustration of a generative adversarial network 


At training time both true and generated examples are provided to the discrimi- 
nator. The discriminator and generator are trained jointly, with the generator’s ob- 
jective to increase the error of the discriminator, and the discriminator’s objective 
to decrease its error. This is related to the minimax decision rule used in statistics 
and decision theory in zero-sum games. This technique has been used as both a 
regularization technique and a way to generated synthetic data. 

For a generator G and a discriminator D the objective function is given by: 


min max Ep, [log(D(x))| + Exp, [log(1 — D(x))], (4.61) 


where IP, and IP, represent the real data distribution and generated data distribution, 
respectively, and x = G(z), where z is drawn from a noise distribution such as the 
Gaussian distribution. 

GANs tend to be used more commonly in computer vision rather than NLP. For 
example, some amount of Gaussian noise can be added to an image, while still 
maintaining the overall structure and meaning of the image’s content. Sentences 
are typically mapped to a discrete space instead of a continuous space, as a word is 
discrete (present or not), where noise cannot be readily applied without changing the 
meaning. However, a form of character-level language modeling was accomplished 
in [Gul+17] by using a latent vector to generate 32 one-hot character vectors through 
a convolutional neural network. 


4.7 Framework Considerations 


The majority of the architectural and algorithmic considerations that have been dis- 
cussed are already implemented in deep learning frameworks, with CPU and GPU 
support. Many of the differences center on the implementation language, target 
users, and abstractions. The most common implementation language is C++ with 
a Python interface. The target users can vary broadly and with that variation, the 
decisions on abstractions. A key abstraction is how deep networks are composed. 
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Early abstractions focused on layers as blocks of computation that could be linked 
together, while more recent frameworks rely on a computational graph approach. 


4.7.1 Layer Abstraction 


Earlier, we briefly introduced the concept of the layer abstraction, referring to the 
linear transformation operation as a “linear layer.’ Conceptually, we can continue 
the layer abstraction to include all portions of the neural network, representing the 
MLP in Fig. 4.8 as three layers with one hidden layer as shown in Fig. 4.26. 


Input Hidden Output 
Layer Layer Layer 











Fig. 4.26: Layer representation of an MLP 


Note that although we have represented the inputs, non-linearities, and output as 
layers this is still a single hidden layer network. 

This makes it easier to split a neural network into logical blocks that can be 
composed together. Early deep learning frameworks took this approach for com- 
posing neural networks. One could create any layer by implementing a minimal set 
of functions, namely the forward propagation step and backward propagation step. 
The layers are connected to form a neural network. 

This abstraction is useful when constructing standard neural networks, with de- 
fined behavior and has been a common approach for frameworks. It is reasonably 
straightforward to reason about the interaction of the layers and make guarantees 
about the computational requirements. A downside to this approach, as we will see, 
the layer abstraction becomes difficult when dealing with complex network struc- 
tures. For example, if we wanted recursive connections in a network, we usually 
have to implement all recurrent computation in a single layer block (we will explore 
this more in the Chap. 7). 
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4.7.2 Computational Graphs 


Many frameworks have since moved beyond the layer abstraction to computational 
graphs. The computational graph approach is similar in concept to abstract symbol 
trees (AST) in compilers. A dependency graph of inputs and outputs can be repre- 
sented with symbols in a tree. This allows a compiler to generate assembly instruc- 
tions linking libraries and functions for an executable model. Data flows through the 
AST based on the dependencies present in the graph. 

In deep learning, a computational graph is a directed graph which defines the 
order of computations. The nodes of the graph correspond to operations or vari- 
ables. The inputs of a specific node into the graph are the dependencies present in 
the computational graph. Subsequently, the backpropagation process can readily be 
determined by following the operations in the reverse order from which they were 
computed in the forward propagation step. An example of a neural network compu- 
tational graph is shown in Fig. 4.27. 
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Fig. 4.27: [llustration of a 3-layer neural network with the corresponding backward 
computation graph. Notice how certain operations can still be combined program- 
matically for optimization (e.g., Addmm combines the addition and the multiplica- 
tion into a single operation.) (a) Illustration of a 3-layer neural network with sigmoid 
activation functions and a softmax output for 10 classes. (b) Computational graph 
built from the network shown in (a) 
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4.7.3 Reverse-Mode Automatic Differentiation 


Not only is the computational graph approach convenient for complex functions, it 
can be extended to allow for simpler gradient approximations in a complex neural 
network. Gradient computation is central in neural networks. One of the most dif- 
ficult parts of programming deep neural networks is the gradient computation for 
a specific layer or operation. However, the graph-based approach to deep learning 
allows for the efficient and automatic computation of gradients in the reverse mode 
over the computational graph. 

Computational graphs make it much easier to leverage reverse-mode automatic 
differentiation methods. Automatic differentiation (AD) [GW08] is a method used 
to compute the derivative of a function numerically. AD leverages the concept that, 
in computers, all mathematical computation is executed as a sequence of basic math- 
ematical operations (addition, subtraction, multiplication, exp, log, sin, cos, etc.). 
The AD approach leverages the chain rule of differentiation to decompose a function 
into the differentials for each basic operation in the function. This allows derivatives 
to be applied automatically and accurately (within a small precision of the theoret- 
ical derivative). This approach is typically straightforward to implement achieving 
much simpler implementations for complex architectures. 

The algorithm of reverse mode AD [Spe80] is the select approach to AD for 
deep learning, because it differentiates a single scalar loss. The forward propagation 
operation can be seen through the computational graph. This graph can be finely 
decomposed to the primitive operations, and, during the backward pass, the gradient 
for the output can be computed with respect to the scalar error. 


4.7.4 Static Computational Graphs 


Static computational graphs are graphs that have been created with a static view 
of memory. The static structure allows for the optimization of the graph before it 
is computed, allowing parallel computation and optimal sequencing of operations. 
For example, fusing certain operations may reduce the time needed for memory 
IO or efficient optimization of the computation across a collection of GPUs that 
may improve the overall performance. This upfront optimization cost is beneficial 
when there are resource constraints such as in embedded applications, or when the 
network architecture is relatively rigid, as it repeatedly executes the same graph with 
little variability in the input. 

One of the disadvantages of static computational graphs is that once they are 
created, they cannot be modified. Any modifications would eliminate potential ad- 
vantages in the applied optimization strategy. 
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4.7.5 Dynamic Computational Graphs 


Dynamic computational graphs take a different approach where the operations are 
computed dynamically at run-time. This is useful in situations where you do not 
know what the computation will be beforehand or where we would like to execute 
different computations on given data points. A clear example of this is recursive 
computation in recurrent neural networks that are based on time sequence inputs of 
often variable-length. Dynamic computation is often desirable in NLP applications 
where sentence lengths differ and similarly in ASR with the variable lengths of 
audio files. 

Each of these approaches has trade-offs, much like comparing dynamic typed 
programming languages with statically typed languages. Two current examples of 
each of these approaches are TensorFlow [Aba+15] and PyTorch [Pas+17]. Tensor- 
Flow relies on static computational graphs while PyTorch utilizes dynamic compu- 
tational graphs. 


4.8 Case Study 


In this section, we will apply the concepts of this chapter to the common Free Spo- 
ken Digit Dataset? (FSDD). FSDD is a collection of 1500 recordings of spoken 
digits, O-9, from 3 speakers. We increase the number of files by performing data 
augmentation. We discuss this in the next section. 

The spoken words are relatively short (most less than 1.5s). In its raw form, 
audio is a single series of samples in the time domain, however it is typically more 
useful to convert it to the frequency domain using an FFT. We convert each audio 
file to a logMel Spectrogram. 

A spectrogram shows the features in a two-dimensional representation with the 
intensity of a frequency at a point in time. These representations will be discussed 
more in Chap. 8. A set of logMel spectrogram samples from the FSDD dataset are 
shown in Fig. 4.28. 


4.8.1 Software Tools and Libraries 


In these sections, we will use PyTorch for our example code. We find that the code 
used for PyTorch mixes effortlessly with Python, making it easier to focus on the 
deep learning concepts rather than the syntax associated with other frameworks. 
In addition to PyTorch, we also use librosa to perform the audio manipulation and 
augmentation. 


> https://github.com/Jakobovski/free-spoken- digit- dataset. 
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Sample from Spoken Digits 





Fig. 4.28: FSDD sample, showing logMel spectrograms for spoken digits 


4.8.2 Exploratory Data Analysis (EDA) 


The original FSDD dataset contains 1500 examples, with no dedicated validation or 
testing set. This is a relatively small number of examples, when considering deep 
learning, so we scale the dataset by using data augmentation. We focus on two types 
of augmentation time stretching and pitch shifting. Time stretching either increases 
or decreases the length of the file, while pitch shifting moves the frequencies higher 
or lower. For time stretching we move the file 25% faster or slower, and with pitch 
shifting we shift up or down one half-step. Every combination of these is applied to 
each file, yielding 13,500 examples, a 9x increase in the amount of data. 


i samples, sample_rate = librosa.load(file_path ) 
2 POreits sin aOn 7 Se he? |e 
3 for ps in [-—1,0,+1]: 


4 samples_new = librosa.effects.time_stretch (samples , 
abe— FS.) 
5 y-new = librosa.effects.pitch_shift (samples_new , 


sample_rate , n_steps=ps) 


The neural networks described so far are only able to take fixed length inputs. The 
temporal nature of speech makes that difficult, as some of the files are longer than 
others. In order to alleviate this constraint, we choose to trip all files to a maximum 
duration of 1.5 s. This allows us to work with a fixed representation for all files. This 
also helps when batching, as all files in a batch should typically be the same length 
for computational efficiency. 

After increasing the total amount of data and limiting the length, we randomly 
split into training, validation, and testing sets. 80% of the data is used for training, 
10% for validation, and 10% for testing. 

We use librosa to obtain the logMel Spectrogram, with 128 mel filters applied 
(typically < 40 is still fine). 
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i max_length = 1.5 # Max length in seconds 


2 samples, sample_rate = librosa.load(file_path ) 

3 Short_samples = librosa.util.fix_length(samples , sample_rate x 
max_length ) 

4 melSpectrum = librosa.feature.melspectrogram(short_samples. 
astype(np.floatl6), sr=sample_rate , n_mels=128) 

5s logMelSpectrogram = librosa.power_to_db(melSpectrum , ref=np. 
max ) 


In addition to saving the audio files in a raw, wav format, we also save them as 
numpy arrays. Loading numpy arrays is much faster during training, especially if 
we are applying any augmentation. The input data will be the scaled pixel input 
from the spectrograms. The dimensionality of the input will be d x t, where d is 
the number of mel features extracted and ¢t, the number of time steps. At loading 
time, we normalize the logMel spectrogram to be between O and 1. Converting the 
data range from the power decibel range |—80,0] to be continuous in the range (0, 1| 
alleviates the need for the network to learn higher weights in the early stages of 
training. This typically makes training more stable, as there is less internal covariate 
shift. 

In theory, scaling and normalization is not necessarily required in neural net- 
works. Any normalization can be converted by changing the weights and bias asso- 
ciated with the input to achieve the same outcome. However, some gradient descent 
methods are very sensitive to scaling, standardizing the input data reduces the need 
for the network to learn extreme values for outliers. This typically improves training 
times, because it reduces the dependency on the scale of the initial weights. 

The next thing that we would like to look for in our data 1s if there 1s a class or 
dataset imbalance. If there is a substantial class imbalance, then we would want to 
ensure that we have a representative sample across our datasets. Figure 4.29 shows 
a histogram for our dataset splits. From the histograms, we can see that each class 
is well represented in each of our sets, and that all classes are relatively balanced in 
the number of examples per class. This usually is true for academic datasets, but is 
infrequently the case in practice. 

Now that we have a good representation of our data, we will show an example of 
a supervised classification problem with a neural network as well as an unsupervised 
learning method using an autoencoder. 


4.8.3 Supervised Learning 


A supervised classifier first requires us to define an error function that we optimize. 
We use the cross-entropy loss for our model with a softmax output. In practice, 
the log of the softmax is used to prevent underflow if the probabilities of one class 
become very low. 

The second step is to define our network architecture. The architecture is often 
obtained experimentally, considering computational resources, and representational 
power. In our example, we initially choose a small, 2-hidden layer network with 128 
neurons in each layer with a ReLU activation function after each hidden layer. This 
network is shown in Fig. 4.30. 
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Fig. 4.29: Histograms for the FSDD training, validation, and testing sets. Each ex- 
ample has a spoken label of 0-9. The distribution between the classes is roughly 
consistent across the datasets 


The PyTorch network definition is shown below: 


1 import torch.nn as nn 


ip) 


3 # PyTorch Network Definition 
4 class Model(nn. Module): 


def 


def 


Shite (self): 

super(Model, self). -_init__() 
self.i¢cl =] nn. binear(3072. 128) 
Seliate? —— ai setimeam( lL? Seems ) 
Selr. tc3 = nn Cimearil2s 2.10) 


forward(self , x): 

xX = x.view((—1, 3072)) # Converts 2D data to 1D 
ho= -Sieln fel (e) 

h torch.relu(h) 


h self .fc2(h) 
h = torch.relu(h) 


h self .fc3(h) 
out = torch. log_softmax (h, dim=1) 
return out 


In the network definition, we only need to instantiate the learned layers, and the 
forward function then defines the order of computation that will be executed. 
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Fig. 4.30: 3-layer neural network for FSDD classification. A ReLU layer is used as 
the activation function after the first two hidden layers and a log-softmax transfor- 
mation after the output layer 


Linear layers expect the input to be represented in a |-dimensional form. Thus 
we include a call to the “view” function, which converts the input from the 2- 
dimensional input into 1-dimension.° 

The gradients are paired with each learnable parameter, thus for each step in the 
forward pass, memory is reserved for the gradient at that step. After passing data 
through our network we will have an output tensor of size [n, 1,1, 10]. We can then 
compute our loss by using our error metric, cross entropy. This function takes in 
two tensors of the same size and computes the scalar loss. The backward function 
on the loss then computes the gradient of all parameters that contributed to the loss 
in reverse order using backpropagation. Once the backward pass has been performed 
we call a single step for our optimizer which takes one step in the direction of the 
gradient (with respect to our learning rate and other hyperparameters). We repeat 
this process for the entire dataset for e epochs. The Python code for the training 
function is shown below. 


| import torch.optim as optim 

2 use_cuda = torch.cuda.is_available() # Run on GPU if 
available 

3 


© Note, PyTorch can still train in mini-batch mode. The view function converts the input tensor 
into the dimensions |n, 1, 1,3072], where n is the mini-batch size. 
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4 # Neural Network Training in PyTorch 
s model = Model() 

6 model. train () 

7 1f use_cuda: 





8 model. cuda() 

9 Optimizer = optim.Adam(model.parameters(), Ir=0.01) 
0 n_epoch = 40 

i for epoch in range(n_epoch): 

2 for data, target in train_loader: 

3 # Get Samples 

4 if use_cuda: 

5 data, target = data.cuda(), target.cuda() 
6 

7 # Clear gradients 

8 optimizer. zero_grad () 

9 

20 # Forward Propagation 

21 y_pred = model(data) 

23 # Error Computation 

24 loss = torch.cross_entropy(y_pred, target) 
26 # Backpropagation 

oF loss . backward () 

28 

29 # Parameter Update 


30 optimizer. step () 


This code snippet is not complete because it does not incorporate validation eval- 
uation during the training process. A more robust example is given in the accompa- 
nying notebook. It is left to the reader to experiment with different hyper-parameter 
configurations in the exercises. During the training process, we save a copy of the 
model with the best validation loss. This model is used to compute the error on the 
test set. The training curves and test set results are shown in Fig. 4.31. 

We can additionally modify our network to include some of the regularization 
techniques and activation functions that we discussed previously, such as batch nor- 
malization, dropout, and ReLUs. Incorporating these features is a simple modifi- 
cation of the model architecture described previously. The training graph for this 
model is also given in Fig. 4.31. 

1 # PyTorch Network Definition 

2 class Model(nn. Module): 

3 det 2cintte2(selt ): 

4 super(Model, self). __init__() 


5 self.fcl = nn. Linear(3072, 128) 
6 self.bcl = nn. BatchNormld(128) 


8 Sele? =i mniwinedr (LZ. 2s) 
9 self.bc2 = nn. BatchNormld(128) 


im self.fc3 = nn.Linear(128, 10) 
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diet torward(seli , ~): 





4 X = x.view((—l1, 3072)) 

5 = Sse a hell (oq) 

6 he = selr 26e1(h) 

7 h = torch.relu(h) 

8 h = F.dropout(h, p=0.5, training=self.training) # 


Disabled during evaluation 


self .fc2(h) 

Sel spe? (ih) 

torch.relu(h) 

F.dropout(h, p=0.2, training=self.training) # 
Disabled during evaluation 


ey jay ler er 
Ho Ul 


25 i SssSe ln hes (i) 
26 out = torch. log_softmax (h,dim=1) 
27 return out 


4.8.4 Unsupervised Learning 


For the unsupervised example, we will train a simple autoencoder on the FSDD 
dataset. This autoencoder learns a low-dimensional encoding of the input data that 
the decoder is able to produce examples, and the architecture that we will use in this 
example is shown in Fig. 4.32. 

Because this is an unsupervised task, we will use the MSE error function com- 
paring our input with the output of our decoder. The output of our network must be 
the same size as our input, d = 3072, thus the final layer of our network must ensure 
that the dimensionality matches the input. 

The network architecture is a very simple definition with four linear layers 
learned for each of the encoder and the decoder. The PyTorch autoencoder defi- 
nition is show below. 

1 import torch.nn as nn 


2 import torch.nn.functional as F # In place operations for non 
—linearities 


4 # PyTorch Network Definition 

5 class autoencoder(nn. Module): 

6 del 21miteaGse li). 

7 Ssuper(autoencoder, self). __init__() 


9 Selr ce snc — =m. bie aimcs.0)) 226 2) 
0 Sie bene She? w—s il eine ain 5 1b2 al Zo) 
self.e_fc3 = nn.Linear(128, 64) 


2 self.e_fc4 = nn. Linear (64 ,64) 
4 self.d_fcl = nn.Linear(64, 64) 
5 self.d_fc2 = nn.Linear(64, 128) 


6 self.d_fc3 = nn. Linear(128, 512) 
7 self.d_fc4 = nn.Linear(512, 3072) 
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Fig. 4.31: Learning curve for a 40 epoch run with two different architecture defini- 
tions. Notice the stability of the regularized architecture in (b) compared to (a). (a) 
Learning curve for a 40 epoch run of the 2-hidden layer network shown in Fig. 4.30. 
On the test set, the best performing validation model achieves a loss of 2.3050 an 
accuracy of 10%, statistically the same as random guessing. (b) Learning curve for 
a 40 epoch run of the 2-hidden layer network shown in Fig. 4.30 with the incor- 
poration of batch normalization, and dropout. On the test set, the best performing 
validation model achieves a loss of 0.0825 an accuracy of 98% 


19 def forward(self, x): 

20 # Encoder 

21 la = ee ell Sellr . Gre I) 
22 ho= EereluCseli se-1e2 (n)) 
23 he Ere lucese i cenc>s Cin)n) 
24 h = self.e_fc4(h) 

26 # Decoder 

27 he = Fe relurseli dare h))) 
28 n= FereluCselt sdere2 (hi) 
29 h = Ferelu (self d-fe3 (h)) 
30 h = self.d_fc4(h) 

31 out = F.tanh(h) 


33 return out 
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Fig. 4.32: Autoencoder for the FSDD dataset. Note: layer sizes define the output 
size of that layer 


The training algorithm is very similar to the one that was introduced for the 
classification example. We use the Adam optimizer and add a weight decay term 
for regularization. Additionally, as we will be using the same size input as output, 
we will move the 2D to 1D transformation outside of the model. The rest of the 
algorithm is the same as previously shown. The training algorithm is shown below. 


i import torch.optim as optim 

2 import torch.nn.functional as F 

4 # Neural Network Training in PyTorch 

5 model = autoencoder() 

6 Optimizer = optim .Adam( 

7 model.parameters(), Ir=learning_rate , weight_decay=le—5) 


9 for epoch in range(n_epoch): 
10 for data, _ in train_loader: 
i # Get samples 
input = data.view(—1,3072) # We will reuse the 
1lormatted input as Our target 


14 # Forward Propagation 
15 output = model(input) 
16 

{7 # Error Computation 
18 loss = F.mse_loss(output , input) 
19 

20 # Clear gradients 

21 optimizer. zero_grad () 
23 # Backpropagation 

24 loss . backward () 

26 # Parameter Update 


_ optimizer.step() 
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A sample of the decoded output of an input example is shown in Fig. 4.33. 





(a) 


Fig. 4.33: Autoencoder output after mn epoch(s) on the training data. Notice how 
the horizontal lines in the spectrogram are starting to form differently for separate 
inputs. (a) Autoencoder reconstruction of its input after 1 epoch. (b) Autoencoder 
reconstruction of its input after 100 epochs 


When examining the reconstructed inputs, we notice that they appear to be less 
sharp than the examples shown in Fig. 4.28. This is mainly due to the MSE loss 
function. Because it is computing the squared error, it tends to pull all values toward 
the mean prioritizing the average over specific areas of the input. 


4.8.5 Classifying with Unsupervised Features 


The RBM learns unsupervised features during the training process. Once these un- 

supervised features are learned, we can create a low-dimensional, labeled dataset 

using these features to be used in a supervised classifier. In our example, we train a 

RBM and then use the learned features as input to a logistic regression classifier. 
We can define an RBM with the following code: 


class RBM(nn. Module): 
det. - Sint Csielfi . Wevis=3072. nehin=(28 9k=5): 


1 

3 super (RBM, self). __init__() 

4 self .W = nn. Parameter(torch.randn(n_hin ,n_vis )*le—2) 

5 self.v_bias = nn. Parameter(torch. zeros(n_vis ) ) 

6 self. h_bias = nn. Parameter(torch. zeros(n_hin) ) 

7 self.k = k 

8 

9 def sample_from_p(self ,p): 

10 return F.relu(torch.sign(p — Variable(torch.rand(p. 
size ())))) 

11 

12 def v_to_h(self ,v): 

13 p-h = F.sigmoid(F. linear(v, self .W, self .h_bias )) 

14 sample_-h = self.sample_from_p(p_h) 


15 return p_-h,sample_h 
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17 det h toy (self jh); 


18 p_v = F.sigmoid(F. linear(h, self .W.t(),self.v_bias ) ) 
19 sample_v = self.sample_from_p(p_v) 
20 return p_v,sample_v 

21 

22 def forward(self ,v): 

23 pre_hl ,hl = self.v_to_h(v) 

24 

25 he = hl 

26 lor 2] an range Useli 7k) 

a7 pre=v-,v=> = seli  h=to2y (he) 

28 preshe ho = seli. vy toll( v=) 


30 ret hil Vi. v= 


32 def free-enercy (seli .v): 

33 vbias_term = v.mv(self.v_bias ) 

34 wx_b = F. linear(v, self .W, self .h_bias ) 

35 hidden_term = wx_b.exp().add(1).log().sum(1) 
36 return (—hidden_term — vbias_term ).mean() 


We train the model with Adam. The sample code to do this is as follows: 


i tbm = RBM(n_vis=3072, n_hin=128, k=1) 


3 train_op = optim.Adam(rbm.parameters(), 0.01) 
4 for epoch in range(epochs): 


5 loss. = 7] 

6 for _, (data,target) in enumerate(train_loader): 
7 data = Variable(data.view(—1, 3072) ) 

8 sample_data = data. bernoulli() 

9 

0 v,vl = rbm(sample_data ) 

l loss = rbm.free_energy(v) — rbm.free_energy(vl) 
2 loss_.append(loss.data[0]) 

3 train_op. zero_grad () 

4 loss . backward () 

5 train_op. step () 





After training our RBM features, we can create a logistic regression classifier to 
classify our examples based on the unsupervised features we have learned. 


i from sklearn.linear_model import LogisticRegression 


3 clf = LogisticRegression () 
4 clf.fit(train_features , train_labels ) 
5 predictions = clf.predict(test_features ) 


The classifier achieves an accuracy of 71.04% on the dataset, 128-dimensional 
features from the RBM. A confusion matrix for the classifier is shown in Fig. 4.34. 
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Fig. 4.34: Confusion matrices for a logistic regression classifier with RBM features 
on the FSDD dataset. (a) Confusion matrix for FSDD. (b) Normalized confusion 
matrix for FSDD 


4.8.6 Results 


Combining the conclusions from previous sections, we compare the methods of 
classification in Table 4.1. 
4.8.7 Exercises for Readers and Practitioners 


Some other interesting problems readers and practitioners can try on their own in- 
clude: 


Table 4.1: End-to-end speech recognition performance on FSDD test set. High- 
lighted result indicates best performance 


Approach Accuracy 
2-layer MLP 10.38 
2-layer MLP (with regularization) 98.44 
RBM + Logistic Regression 71.04 


1. What is the effect of training the FSDD classifier with each of the learning rates 
[(0.001,0.1, 1.0, 10]? What is the effect when switching the optimization method? 

2. What is the result of a learning rate of 0.1 for the FSDD autoencoder? 

3. How would the architecture change if we want to learn a set of sparse features 
instead of a low-dimensional encoding of the handwritten digits? 

4. What effect does batch size have on the learning process? Does it effect the 
learning rate? 
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5. What additional data augmentations could be applied to the audio for the system 
to be more robust? 

6. Train a classifier with the trained autoencoder’s encoding as the features. How 
does the accuracy compare to the supervised model? 

7. Change the autoencoder to a variational autoencoder. Does it improve the visible 
quality of the generated output? Vary the inputs to the decoder to understand the 
features that have been learned. 

8. Extend the RBM to create a deep belief network for classifying the FSDD 


dataset. 
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Chapter 5 ®) 
Distributed Representations oe 


5.1 Introduction 


In this chapter, we introduce the notion of word embeddings that serve as core 
representations of text in deep learning approaches. We start with the distributional 
hypothesis and explain how it can be leveraged to form semantic representations of 
words. We discuss the common distributional semantic models including word2vec 
and GloVe and their variants. We address the shortcomings of embedding models 
and their extension to document and concept representation. Finally, we discuss 
several applications to natural language processing tasks and present a case study 
focused on language modeling. 


5.2 Distributional Semantics 


Distributional semantics is a subfield of natural language processing predicated on 
the idea that word meaning is derived from its usage. The distributional hypothesis 
states that words used in similar contexts have similar meanings. That is, if two 
words often occur with the same set of words, then they are semantically similar 
in meaning. A broader notion is the statistical semantic hypothesis, which states 
that meaning can be derived from statistical patterns of word usage. Distributional 
semantics serve as the fundamental basis for many recent computational linguistic 
advances. 


5.2.1 Vector Space Model 


Vector space models (VSMs) represent a collection of documents as points in a 
hyperspace, or equivalently, as vectors in a vector space (Fig. 5.1). They are based 
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on the key property that the proximity of points in the hyperspace is a measure of 
the semantic similarlity of the documents. In other words, documents with similar 
vector representations imply that they are semantically similar. VSMs have found 
widespread adoption in information retrieval applications, where a search query is 
achieved by returning a set of nearby documents sorted by distance. We have already 
seem VSMs in the form of the bag-of-words term-frequency or TFIDF example back 
in Chap. 3. 





Fig. 5.1: Vector space model representation for documents 


5.2.1.1 Curse of Dimensionality 


VSMs can suffer from a major drawback if they are based on high-dimensional 
Sparse representations. Here, sparse means that a vector has many dimensions with 
zero values. This is termed the curse of dimensionality. As such, these VSMs re- 
quire large memory resources and are computationally expensive to implement and 
use. For instance, a term-frequency based VSM would theoretically require as many 
dimensions as the number of words in the dictionary of the entire corpus of docu- 
ments. In practice, it is common to set an upper bound on the number of words and 
hence, dimensionality of the VSM. Words that are not within the VSM are termed 
out-of-vocabulary (OOV). This is a meaningful gap with most VSMs in that they 
are unable to attribute semantic meaning to new words that they haven’t seen before 
and are OOV. 


The distributional hypothesis says that the meaning of a word is derived from 
the context in which it is used, and words with similar meaning are used in 
similar contexts. 
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5.2.2 Word Representations 


One of the earliest use of word representations dates back to 1986. Word vec- 
tors explicitly encode linguistic regularities and patterns. Distributional semantic 
models can be divided into two classes, co-occurrence based and predictive mod- 
els. Co-occurrence based models must be trained over the entire corpus and cap- 
ture global dependencies and context, while predictive models capture local depen- 
dencies within a (small) context window. The most well-known of these models, 
word2vec and GloVe, are known as word models since they model word depen- 
dencies across a corpus. Both learn high-quality, dense word representations from 
large amounts of unstructured text data. These word vectors are able to encode lin- 
guistic regularities and semantic patterns, which lead to some interesting algebraic 
properties. 


§.2.2.1 Co-occurrence 


The distributional hypothesis tells us that co-occurrence of words can reveal much 
about their semantic proximity and meaning. Computational linguistics leverages 
this fact and uses the frequency of two words occurring alongside each other within 
a corpus to identify word relationships. Pointwise Mutual Information (PMI) is a 
commonly used information-theoretic measure of co-occurrence between two words 
Ww and w3: 


P(W1,W2) 


PMI ww) = 108 Gy) pwn) 


(5.1) 


where p(w) is the probability of the word occurring, and p(w 1,w2) is joint probabil- 
ity of the two words co-occurring. High values of PMI indicate collocation and co- 
incidence (and therefore strong association) between the words. It is common to es- 
timate the single and joint probabilities based on word frequency and co-occurrence 
within the corpus. PMI is a useful measure for word clustering and many other tasks. 


5.2.2.2 LSA 


Latent semantic analysis (LSA) is a technique that effectively leverages word co- 
occurrence to identify topics within a set of documents. Specifically, LSA analyzes 
word associations within a set of documents by forming a document-term matrix 
(see Fig. 5.2), where each cell can be the frequency of occurrence or TFIDF of 
a term within a document. As this matrix can be very large (with as many rows 
as words in the vocabulary of the corpus), a dimensionality reduction technique 
such as singular-value decomposition 1s applied to find a low-rank approximation. 
This low-rank space can be used to identify key terms and cluster documents or for 
information retrieval (as discussed in Chap. 3). 
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Fig. 5.2: LSA document-term matrix 


5.2.3 Neural Language Models 


Recall that language models seek to learn the joint probability function of sequences 
of words. As stated above, this is difficult due to the curse of dimensionality—the 
sheer size of the vocabulary used in the English language implies that there could be 
an impossibly huge number of sequences over which we seek to learn. A language 
model estimates the conditional probability of the next word wr given all previous 
words w;: 


T 
P(wr) =| [ p(wilwi,..-,w-1) (5.2) 


t=1 


Many methods exist for estimating continuous representations of words, including 
latent semantic analysis (LSA) and latent Dirichlet allocation (LDA). The former 
fails to preserve linear linguistic regularities while the latter requires huge computa- 
tional expense for anything beyond small datasets. In recent years, different neural 
network approaches have been proposed to overcome these issues (Fig. 5.3), which 
we introduce below. The representations learned by these neural network models 
are termed neural embeddings or simply embeddings and will be referenced as 
such in the rest of this book. 


5.2.3.1 Bengio 


In 2003, Bengio et al. [Ben+03] presented a neural probabilistic model for learn- 
ing a distributed representation of words. Instead of sparse, high-dimensional rep- 
resentations, the Bengio model proposed representing words and documents in 
lower-dimensional continuous vector spaces by using a multilayer neural network 
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Fig. 5.3: Neural language model 


to predict the next word given the previous ones. This network is iteratively trained 
to maximize the conditional log-likelihood J over the training corpus using back- 
propagation: 


T 
j= - ¥ log f(v(W2),¥(We-1), +5 ¥0% 41); 0) + R(8) (5.3) 
(=1 


where v(w;) is the feature vector for word w;, f is the mapping function representing 
the neural network, and R(@) is the regularization penalty applied to weights 0 
of the network. In doing so, the model concurrently associates each word with a 
distributed word feature vector as well as learning the joint probability function of 
word sequences in terms of the feature vectors of the words in the sequence. For 
instance, with a corpus of vocabulary size of 100,000, a one-hot encoded 100,000- 
dimensional vector representation, the Bengio model can learn a much smaller 300- 
dimensional continuous vector space representation (Fig. 5.4). 


5.2.3.2 Collobert and Weston 


In 2008, Collobert and Weston [CW08] applied word vectors to several NLP tasks 
and showed that word vectors could be trained in an unsupervised manner on a 
corpus and used to significantly enhance NLP tasks. They used a multilayer neu- 
ral network trained in an end-to-end fashion. In the process, the first layer in the 
network learned distributed word representations that are shared across tasks. The 
output of this word representation layer was passed to downstream architectures that 
were able to output part-of-speech tags, chunks, named entities, semantic roles, and 
sentence likelihood. The Collobert and Weston’s model is an example of multitask 
learning enabled through the adoption of dense layer representations. 
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Fig. 5.4: Sparse vs. dense representations 


Neural language models can be trained by stochastic gradient descent and 
thereby avoid the heavy computational and memory burden of storing co- 
occurrence matrices in memory. 


5.2.4 word2vec 


In 2013, Mikolov et al. [Mik+13b] proposed a set of neural architectures could 
compute continuous representations of words over large datasets. Unlike other neu- 
ral network architectures for learning word vectors, these architectures were highly 
computationally efficient, able to handle even billion-word vocabularies, since they 
do not involve dense matrix multiplications. Furthermore, the high-quality represen- 
tations learned by these models possessed useful translational properties that pro- 
vided semantic and syntactic meaning. The proposed architectures consisted of the 
continuous bag-of-words (CBOW) model and the skip-gram model. They termed 
the group of models word2vec. They also proposed two methods to train the models 
based on a hierarchical softmax approach or a negative-sampling approach. 

The translational properties of the vectors learned through word2vec models can 
provide highly useful linguistic and relational similarities. In particular, Mikolov et 
al. revealed that vector arithmetic can yield high-quality word similarities and analo- 
gies. They showed that the vector representation of the word queen can be recovered 
from representations of king, man, and woman by searching for the nearest vector 
based on cosine distance to the vector sum: 


v(queen) © v(king) — v(man) + v(woman) 
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Vector operations could reveal both semantic relationships such as: 


v(Rome) * v(Paris) — v(France) + v(Italy) 
v(niece)  v(nephew) — v(brother) + v(sister) 
v(Cu) © v(Zn) — v(zinc) + v(copper) 


as well as syntactic relationships such as: 


v(biggest) & v(smallest) — v(small) + v(big) 
v(thinking) © v(read) — v(reading) + v(think) 
v(mice) & v(dollars) — v(dollar) + v(mouse) 


In the next sections, we present the intuition behind the CBOW and skip-gram mod- 
els and their training methodologies. Notably, people have found that CBOW mod- 
els are better able to capture syntactic relationships, whereas skip-gram models ex- 
cel at encoding semantic relationships between words. 


Note that word2vec models are fast—they can quickly learn vector represen- 
tations of much larger corpora than previous methods. 


5.2.4.1 CBOW 


The CBOW architecture is based on a projection layer that is trained to predict a 
target word given a context window of c words to the left and right side of the target 
word (Fig.5.5). The input layer maps each context word through an embedding 
matrix W to a dense vector representation of dimension k, and the resulting vectors 
of the context words are averaged across each dimension to yield a single vector of k 
dimension. The embedding matrix W is shared for all context words. Because word 
order of the context words is irrelevant in the summation, this model is analogous to 
a bag-of-words model, except that a continuous representation is used. The CBOW 
model objective seeks to maximize the average log probability: 


1 T 
r » log (p (wr|wi+;)) (5.4) 
t=1-—c<j<c,jA~0 


where c is the number of context words to each side of the target word (Fig. 5.6). 
For the simple CBOW model, the average vector representation from the output of 
the projection layer is fed into a softmax that predicts over the entire vocabulary of 
the corpus, using backpropagation to maximize the log probability objective: 


exp (vi, EY. aa) 


yee ieee 5.5 
ane Xp (ViVi; ) 


P(w:|Wr+;) = 


where V is the number of words in the vocabulary. Note that after training, the matrix 
W are the learned word embeddings of the model. 
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Fig. 5.5: Continuous bag-of-words model (context window = 4) 
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Fig. 5.6: CBOW vector construction (context window = 2) 
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5.2.4.2 Skip-Gram 


Whereas the CBOW model is trained to predict a target word based on the nearby 
context words, the skip-gram model is trained to predict the nearby context words 
based on the target word (Fig. 5.7). Once again, word order is not considered. For 
a context size c, the skip-gram model is trained to predict the c words around the 
target word. The objective of the skip-gram model is to maximize the average log 
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Fig. 5.7: Skip-gram model (context window = 4) 


probability: 
12 
7d log (p (wis lr) (5.6) 
f= 


—c<j<c,jA~0 
where c is the size of the training context (Fig. 5.8). Higher values of c result in 
more training examples and thus can lead to a higher accuracy, at the expense of the 
training time. The most simple skip-gram formulation utilizes the softmax function: 


rT 
exp (Wa, Vu; | 


P(W;+;|W:) = s-———--—— 
: ae exp (Vin Vin, ) 


(5.7) 


where V is the number of words in the vocabulary. 


It is interesting to note that shorter training contexts result in vectors that cap- 
ture syntactic relationships well, while larger context windows better capture 
semantic relationships. The intuition behind this is that syntactic information 
is typically dependent on the immediate context and word order, whereas se- 
mantic information can be non-local and require larger window sizes. 
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Fig. 5.8: Skipgram vector construction (context window = 2) 


5.2.4.3 Hierarchical Softmax 


The simple versions of CBOW and skip-gram use a full softmax output layer, which 
can be computationally expensive when the vocabulary is large. A more compu- 
tationally efficient approximation to the full softmax is the hierarchical softmax, 
which uses a binary tree representation of the output layer. Each word w can be 
reached by an appropriate path from the root of the tree: 





L(w)-1 
plwimn) = TL o@On i+ =ch(nQwj))))rhony rm 68) 
j=l 
where ; 
O(x) = ia (5.9) 
V 
> P(wlw;) = 1 (5.10) 
w=1 


In practice, it is common to use a binary Huffman tree, which assigns short codes 
to the frequent words and results in fast training as it requires only calculating over 
log,(V) words instead of V words for the softmax. 
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5.2.4.4 Negative Sampling 


Mikolov et al. [Mik+13b] proposed an even better alternative to the hierarchical 
softmax based on noise contrastive estimation (NCE). NCE is premised on the 
notion that a good model should be able to differentiate data from noise via lo- 
gistic regression. Negative sampling is a simplification of NCE that seeks separate 
true context words from randomly selected words by maximizing the modified log 
probability: 


k 
loz (6 (¥iyp"¥w,)) +X Ewnmniw) log (6 (—v,,"¥m))] GAD 
i=] 


When choosing the number of negative samples k, note that word2vec’s per- 
formance will decrease as this number increases in most cases. In practice, k 
in the range of 5—20 can be used. 


The main difference between negative sampling and NCE is that NCE needs both 
samples and the numerical probabilities of the noise distribution, while negative 
sampling uses only samples. 


YY 2 (8, We)) + SY £(=s(wr,n)) (5.12) 


t=1 CECy NEN tc 
s(w,c)= ) ahve (5.13) 
geGy 
S(W,We) = Wh, Vv, (5.14) 


We have previously noted the need to remove stop words when using count-based 
methods, as these frequent words can occur at very high rates but convey little se- 
mantic information. When training word vectors, they can have a similar dispro- 
portionate effect. A common way to deal with this effect is to subsample frequent 
words. During training, each word w; is potentially discarded with probability: 


i 
f (wi) 


Subsampling can considerably speed up training times as well as increase the accu- 
racy of the learned vectors of rare words. 





p(wi) =1— (5.15) 
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5.2.4.5 Phrase Representations 


Previous word representations are limited by their inability to satisfy composition- 
ality—that is, they cannot infer the meaning of phrases from the individual words. 
Many phrases have a meaning that is not a simple composition of the meanings of 
its individual words. For example, the meaning of New England Patriots is not the 
sum of the meanings of each word. 

To deal with this, one approach is to represent phrases by replacing the words 
with a single token (e.g., New_England_Patriots). This process can be automated 
using a scoring mechanism: 


count (w;, Wj) 


(5.16) 


;,W;) = log ——_—_.——_~ 
score(Wi, Wj) & count (w;)count (w;) 


such that words are combined and replaced by a single token whenever the score 
rises above a threshold value. This equation is an approximation to the pointwise- 
mutual information. 

Interestingly, word2vec models and in particular the skip-gram model have 
shown the vector compositionality—the ability to use simple vector additions can 
often produce meaningful phrases. Adding the vectors for Philadelphia and Eagles 
can yield a vector that is in closest proximity to other sports teams. 


5.2.4.6 word2vec CBOW: Forward and Backward Propagation 


We will derive equations for forward and backward propagation for CBOW to give 
the readers insight into the training mechanisms and how the weights are updated. 
Let a single input word be represented as a one-hot vector given by x € R" where 
V is the vocabulary and many such word vectors given by {x1,X2,--: ,xc} of size C 
form the context. Let the vectors flow into a single hidden layer h € R”, where D 
is the dimension of the embeddings to be learned through training, with the identify 
activation function and the input values are averaged across the context words. Let 
W €R'”? be the weight matrix that captures weights between input and the hidden 
layer. Figure 5.9 shows the different layers and the connections as described above. 
The hidden layer can be given as: 


h=w"(- ¥ x) (5.17) 


c=1 


C 
We will represent the 7 > X- as the average input vector given by x. Thus: 
c=1 


h= W'x (5.18) 
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The hidden layer h € R” is mapped to a single output layer u € R” with weights 
W’ € R”~" . This is given by: 


Input (ene hee’) Hidden Output so Ptrmax 
X Lagncr Layer Function 
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Fig. 5.9: word2vec CBOW with one-hot encoded inputs, single hidden layer, output 
layer and a softmax layer 


u= Wh (5.19) 
u= W'W'x (5.20) 

The output layer is then mapped to a softmax output layer y € R" given by: 
y = softmax(u) = softmax(W’' Wx) (5.21) 


When we are training the model with target (w;)-context words(W¢1,We2,°** ,We,c), 
the output value should match the target in the one-hot encoded representation, 1.e. 
at position j* the output has value | and 0 elsewhere. The loss in terms of conditional 
probability of target word given context words is given by: 


£L = —logP(wi|We1,We2,°** »Wec) = —log(yj*) = —log(softmax(uj*)) (5.22) 


ios soe 
L = —log(y;*) = —log G xp ay) (5.23) 
£ = —uj;*+log Yexp(uj) (5.24) 


The idea of training through gradient descent as discussed in Chap. 4 is to find values 
of W and W’ that minimize the loss function given by Eq. (5.24). The loss function 
depends on W and W’ through the output variable u. So to find the values we differ- 
entiate the loss function £ with respect to both W and W’. Since £ = £(u(W, W’)) 
the two derivatives can be written as: 


Y 0L 0 
Wi = > oa aw! 





(5.25) 
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(5.26) 








V 0L oa 
Wii Ou, OW; 
iW; eae 


Let us consider Eq. (5.25) where Wii is the connection between hidden layer i and 
output layer j and since the output is one-hot encoded affects only at value k = j 
and will be 0 in all other places. Thus the equation reduces to: 








OL OL dou; 
OW! ou; 3W7 ee) 
ij j ij 
Now sn can be written as: 
OL 


where —0jj;, is the Kronecker delta where the value is 1 if j = j* and 0 elsewhere. 
This can be represented in the vector form as e € R’. 





Ou; 
The other term can be written in terms of W;7 and average input vector x; as: 





awy 
Ou; v 
—— = Y Wake (5.29) 
OW;, 
Thus combining: 
ah +; 1( ¥, Wi) (5.30) 
7 = (-Ojje +; kiXk . 
OW, j 
This can be written as: ay 
Wi (W'x)@e (5.31) 


Next, u written as the expanded form becomes: 


D V 1 C 
> PAD AD (5.32) 


For Eq. (5.26) after we fix the input, the output y; at node j depends on all the 
connections from that input and thus 


V OL 
oa, 


Cc 


ow (c y yw, Wins) (5.33) 


d=1l1=1 ce 





Wij 


~s 
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1 & 
= (—Oxks + y,)W fs (5.34) 
7 Wiz = C > 2 : 
This can be written as: aye 


Thus the new values Wye, and W’,.y using a learning rate 7 is given by: 


OL 
Wrew — Wold — 7) aw (5.36) 

and a6 
W = W' old — 1) aw! (5.37) 


5.2.4.7 word2vec Skip-gram: Forward and Backward Propagation 


As we have defined that the skip-gram model is the inverse of the CBOW, 1.e., the 
center word is given in the input and the context words are predicted at the output as 
shown in Fig. 5.10. We will derive the skip-gram equations on the same lines using 
a simple network similar to CBOW above. 
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Fig. 5.10: word2vec skip-gram with input word, single hidden layer, generating C 
context words as output that maps to the softmax function generating the one-hot 
representation for each 
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The input x € R" goes into the hidden layer h € R” through the weights W € 
R’ <” and unit activation function. The hidden layer then generates C context word 
vectors u, € R?*" as the output and that can be mapped to a softmax function to 
generate one-hot representation y € R’ that maps a word in the vocabulary for each 
embedding output. 


h=W'x (5.38) 

u.=W"h (5.39) 

ue = W'W'x c=1,::-,C (5.40) 

Y. = softmax(u,) = softmax(W’'W'x) c=1,---,C (5.41) 


The loss function for skip-gram can be written as: 








£L = — log P(We1,We25°** »We,c|Wi) (5.42) 
where the word w; is the input word and We1,Wc,2,°** ,We,c are the output context 
words. 

C 
£L = —log]| | P(we,|wi) (5.43) 
c=1 
Similar to CBOW, this can be further written as: 
C 
7 eEXP(Uc, jx) 


c= Y exp(uc,j) 
= 


C 
L= 


C V 
exp(Uc, jx) + ¥ log & exp(uc,;) (5.45) 
c=1 c=1 j=l 


The loss function is dependent on u, and each u dependent on (W, W’). This can be 
expressed as: 
£L= £(u,(W, W’)---uc(W, W’)) (5.46) 


L = L(ui1 (W,W’) ose UC V (W, W’)) (5.47) 


The chain-rule applied to skip-gram gives: 

















dL Ve Oe OU 

i. / (5.48) 
OW; dd Tuc, OW; 
OL Meee OL OUc.k 
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Similar to CROW we can write: 




















-yy OL OucK y OL duc; 
a OU ck OW), i OU, j OW); 
=Y-Bjt¥M(¥ 
«te )( Wk xx) 
Ojj. | ’ 
aw; c=1 k=1 
where an = — Oj), +Ye,j = Ce,j 
This can be simplified as: 
dL 
Wi (WTx)@ di e& 














dL Ye da oa D V 
- Wine) 
OWij 2 de Fucg OW; 22 
JL VC 
OW: : — by DY (— Skt. + Yok) Wixi; 
| k=1c=1 


Now, representing — Oks. +Ye.k = eck We can simplify it as 


c=] 
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(5.50) 


(5.51) 


(552) 


(5.53) 


(5.54) 


(5.55) 


The global co-occurrence based models can be the alternative to predictive, local- 
context window methods like word2vec. Co-occurrence methods are usually very 
high dimensional and require much storage. When dimensionality reduction meth- 
ods are used like in LSA, the resulting representations typically perform poorly in 
capturing semantic word regularities. Furthermore, frequent co-occurrence terms 
tend to dominate. Predictive methods like word2vec are local-context based and 
generally perform poorly in capturing the statistics of the corpus. In 2014, Pen- 
nington et al. [PSM14] proposed a log-bilinear model that combines both global 
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co-occurrence and shallow window methods. They termed this the GloVe model, 
which is play on the words Global and Vector. The GloVe model is trained via least 
squares using the cost function: 
: 2 
J= > F(X) (ul vj —log (Xi;)) (5.56) 
= 1j=1 


where V is the size of the vocabulary, X;; is the count of times that words 7 and j co- 
occur in the corpus (Fig. 5.11), f 1s a weighting function that acts to reducethe im- 


where in the sacred river ran man to sunlit sea 


where 0 2 1 0 0 0 1 2 0 0 
in 2 0 0 1 0 0 1 0 1 0 
the 1 0 0 4 3 1 5 0 2 1 

sacred 0 1 4 0 2 0 a| 0 0 1 

river 0 0 3 2 0 3 0 1 0 0 
ran 0 0 1 0 3 0 3 3 0 0 
man 1 1 5 1 0 3 0 1 0 2 
to 2 0 0 0 1 3 1 0 1 0 

sunlit 0 1 2 0 0 0 0 1 0 2 
sea 0 0 1 1 0 0 2 0 2 0 


Fig. 5.11: GloVe co-occurrence matrix (context window = 3) 


pact of frequent counts, and u; and v; are word vectors. Typically, a clipped power- 
law form is assumed for weighting function f: 


Xi \" Gtx, <x 
WAG < 
pj) — 4 (Re) XY 
1 otherwise 





(5.57) 


with Xmax 1S set at training time based on the corpus. Note that the model trains 
context vectors U and word vectors V separately, and GloVe embeddings are the 
given by the sum of these two vector representations U+ V. Similar to word2vec, 
GloVe embeddings can express semantic and syntactic relationships through vec- 
tor addition and subtraction [SL14]. Furthermore, word embeddings generated by 
GloVe are superior to word2vec in performance over many NLP tasks, especially in 
situations where global context is important such as named entity recognition. 


GloVe outperforms word2vec when the corpus is small or where insufficient 
data may be available to capture local context dependencies. 
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5.2.6 Spectral Word Embeddings 


Spectral approaches based on eigen-decomposition are another family of methods to 
generate dense word embeddings. One of these, canonical correlation analysis, has 
shown significant potential. This method overcomes many shortcomings of previous 
methods including scale invariance and providing for better sample complexity of 
rare words. 

Canonical correlation analysis (CCA) is analogous to principal component 
analysis (PCA) for pairs of matrices. Whereas PCA calculates the directions of 
maximum covariance within a single matrix, CCA calculates the direction of max- 
imum correlation between two matrices. CCA exhibits desirable properties for use 
in learning word embeddings in that it 1s scale invariant to linear transformations 
and provides better sample complexity. 

The CCA model learns embeddings by first computing the dominant canonical 
correlations between target words and a context of c words nearby [DFU11]. The 
goal is to find vectors @,, and ¢, so that linear combinations are maximally corre- 


lated: T 
Py we Dc 
ie ee (5.58) 
Pw sPc J Ob Cw by a Coc bec 


Similar to LSA, this is accomplished by applying SVD to a scaled co-occurrence 
matrix of counts of words with their context. Thus, the optimization objective can 
be cast as: 


Pie 8), Dic Bac (5.59) 
where 
8), Bow =| (5.60) 
8), Bo. =i (5.61) 
Dyce = Aw? VEC weVeAe 1? (5.62) 


An eigenword dictionary is created from which word embeddings are extracted. 
By using explicit left and right contexts, CCA possesses a “multi-view” capabil- 
ity that can allow it to implicitly account for word order in contrast to word2vec 
or GloVe. This “multi-view” capability can be leveraged to induce context-specific 
embeddings that can significantly improve certain NLP tasks. This is especially true 
if a mixture of short and long contexts are applied which can capture both short- 
and long-range dependencies as necessary in NLP tasks such as word sense disam- 
biguation or entailment. 
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5.2.7 Multilingual Word Embeddings 


It is well known that the distributional hypothesis holds for most human lan- 
guages. This implies that we can train word embedding models in many languages 
[Cou+16, RVS17], and companies such as Facebook and Google have released 
pre-trained word2vec and GloVe vectors for up to 157 languages [Gra+18]. These 
embedding models are monolingual—they are learned on a single language. Sev- 
eral languages exist with multiple written forms. For instance, Japanese possesses 
three distinct writing systems (Hiragana, Katakana, Kanji). Mono-lingual embed- 
ding models cannot associate the meaning of a word across different written forms. 
The term word alignment is used to describe the NLP process by which words are 
related together across two written forms across languages (translational relation- 
ships) (Fig. 5.12) [Amm-+16]. Embedding models have provided a path for deep 
learning to make major breakthroughs in word alignment tasks, as we will learn in 
Chap. 6 and beyond. 


EY FA 


Neural Network 


My computer broken 


Fig. 5.12: Word alignment 


5.3 Limitations of Word Embeddings 


Embedding models suffer from a number of well-known limitations. These include 
out-of-vocabulary words, antonymy, polysemy, and bias. We explore these in detail 
in the next sections. 


5.3.1 Out of Vocabulary 


The Zipfian distributional nature of the English language is such that there exists 
a huge number of infrequent words. Learning representations for these rare words 
would require huge amounts of (possibly unavailable) data, as well as potentially 
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excessive training time or memory resources. Due to practical considerations, a 
word embedding model will contain only a limited set of the words in the En- 
glish language. Even a large vocabulary will still have many out-of-vocabulary 
(OOV) words. Unfortunately, many important domain-specific terms tend to occur 
infrequently and can contribute to the number of OOV words. This is especially true 
with domain-shifts. As a result, OOV words can have crucial role in the performance 
NLP tasks. 

With models such as word2vec, the common approach is to use a “UNK’” repre- 
sentation for words deemed too infrequent to include in the vocabulary. This maps 
many rare words to an identical vector (zero or random vectors) in the belief that 
their rarity implies they do not contribute significantly to semantic meaning. Thus, 
OOV words all provide an identical context during training. Similarly, OOV words 
at test time are mapped to this representation. This assumption can break down for 
many reasons, and a number of methods have been proposed to address this short- 
fall. 

Ideally, we would like to be able to somehow predict a vector representation that 
is semantically similar to either words that are outside our training corpus or that 
occurred too infrequently in our corpus. Character-based or subword (char-n-gram) 
embedding models are compositional approaches that attempt to derive a meeting 
from parts of a word (e.g., roots, suffixes) [Lin+15, LM16, Kim+16]. Subword 
approaches are especially useful for foreign languages that are rich in morphol- 
ogy such as Arabic or Icelandic [CJF16]. Byte-pair encoding is a character-based, 
bottom-up method that iteratively groups frequent character pairs and subsequently 
learning embeddings on the final groups [KB16]. Other methods that leverage ex- 
ternal knowledgebases (e.g., WordNet) have also been explored, including the copy 
mechanism that take into account word position and alignment, but tend to be less 
resilient to shifts in domain [Gu+16, BCB 14]. 


5.3.2 Antonymy 


Another significant limitation is an offshoot of the fundamental principle of distribu- 
tional similarity from which word models are dertved—that words used in similar 
contexts are similar in meaning. Unfortunately, two words that are antonyms of 
each other often co-occur with the same sets of word contexts: 


I really hate spaghetti on Wednesdays. 
I really love spaghetti on Wednesdays. 


While word embedding models can capture synonyms and semantic relation- 
ships, they fail notably to distinguish antonyms and overall polarity of words. In 
other words, without intervention, word embedding models cannot differentiate 
between synonyms and antonyms and it is common to find antonyms closely co- 
located within a vector-space model. 
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An adaptation to word2vec can be made to learn word embeddings that disam- 
biguate polarity by incorporating thesauri information [OMS 15]. Consider the skip- 
gram model that optimizes for an objective function: 


J(0)= ¥ Y {#(w,c) log o (sim(w,c)) 


weV cEV 


+ k#(w)P,(c) log o (—sim(w,c)) } (5.63) 


where the first term are the co-occurrence pairs within a context window and the sec- 
ond term represents negative sampling. Given a set of synonyms S,, and antonyms 
Ay of a word w, we can modify the skip-gram model objective function to the form: 


J(0)= & ¥& logo(sim(w,s))+a ¥ Y logo(—sim(w,s)) 


weV seSy weV acAy 
z Dy y {#(w,c) log o(sim(w,c))klog o (—sim(w,c))} (5.64) 
weV cEV 


This objective can be optimized to learn embeddings that can distinguish synonyms 
from antonyms. Studies have shown that embeddings learned in this manner to in- 
corporate both distributional and thesauri information perform significantly better 
in tasks such as question-answering. 


5.3.3 Polysemy 


In the English language, words can sometimes have several meanings. This is known 
as polysemy. Sometimes these meanings can be very different or complete oppo- 
sites of each other. Look up the meaning of the word bad and you might find up to 
46 distinct meanings. As models such as word2vec or GloVe associate each word 
with a single vector representation, they are unable to deal with homonyms and pol- 
ysemy. Word sense disambiguation is possible but requires more complex models. 

In linguistics, word sense relates to the notion that, in the English language and 
many other languages, words can take on more than one meaning. Polysemy is the 
concept that a word can have multiple meanings. Homonymy is a related concept 
where two words are spelled the same but have different meanings. For instance, 
compare the usage of the word play in the sentences below: 


She enjoyed the play very much. 
She likes to play cards. 
She made a play for the promotion. 


For NLP applications to differentiate between the meanings of a polysemous word, 
it would require separate representations to be learned for the same word, each as- 
sociated with a particular meaning [Nee+14]. This is not possible with word2vec or 
GloVe embedding models since they learn a single embedding for a word. Embed- 
ding models must be extended in order to properly handle word sense. 


5.3 Limitations of Word Embeddings 225 


Humans do remarkably well in distinguishing the meaning of a word based on 
context. In the sentences above, it is relatively easy for us to distinguish the dif- 
ferent meanings of the word play based on the part-of-speech or surrounding word 
context. This gives rise to multi-representation embedding models that can lever- 
age surrounding context (cluster-weighted context embeddings) or part-of-speech 
(sense2vec). We briefly discuss each in the following sections, including other 
model variants. 


5.3.3.1 Clustering-Weighted Context Embeddings 


One approach to deal with word sense disambiguation is to start by building an in- 
ventory of senses for words within a corpus. Each instance of a word w; 1s associated 
with a representation based on context words surrounding it. These representations, 
termed context embeddings, are then clustered together. The centroid of each cluster 
is the representation S,,, for the different senses of the word: 


sense(w;) = argmind(¢;,$;) (5.65) 
JsjESw; 
where d is a distance metric (usually cosine distance). This can be implemented 
as the multi-sense skip-gram model (Fig. 5.13) where each word is associated with 
a vector v with context vectors ¢c and each sense of the word is associated with a 
representation 1. Given a target word, a word sense is predicted based on Veontext: 


Ss, = argmax sim(U(W;,k), Veontext (Cr) ) (5.66) 
k=1,2,...K 
where sim(a,b) is a similarity function. The multi-sense word embeddings are 
learned from a training set by maximizing the objective function: 


J(@)= YY ¥ logP(D = 1|vs(w;,5),Ve(c)) 
(wr cr) EDT CECr 


- », 2 log P(D = O|vs(wr, Sr), Ve(c’)) (5.67) 


(w;,c;)€D~ cl Ec} 


5.3.3.2 Sense2vec 


Multi-sense word embedding models are more computationally expensive to train 
and apply in relation to single-sense models [CP18]. Sense2vec is a simpler method 
to achieve world-sense disambiguation that leverages supervised labeling such as 
part-of-speech [TML15]. It is an efficient method that eliminates the need for clus- 
tering during training as seen in context embeddings. For instance, the meanings of 
the word plant are distinct based on its use as a verb or noun: 


verb: He planted the tree. 
noun: He watered the plant. 
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The sense2vec model can learn different word senses of this word by combin- 
ing a single-sense embedding model with POS labels (Fig. 5.14). Given a corpus, 
sense2vec will create a new corpus for each word for each sense by concatenating a 
word with its POS label. The new corpus is then trained using word2vec’s CBOW 
or skip-gram to create word embeddings that incorporate word sense (as it relates 
to their POS usage). Sense2vec has been shown to be effective for many NLP tasks 


beyond word-sense disambiguation (Fig. 5.15). 


Context Context Cluster Word Sense 
Vectors Centers Vectors 
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Fig. 5.13: Cluster-weighted context embeddings 


bank_verb bank_noun’ bank_pnoun 
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bank 


Fig. 5.14: Sense2vec with POS supervised labeling 
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(He, (at, (the, (bank, 
PRON) ADP) DET) NOUN) 


He banks __ at the bank 





Fig. 5.15: Sense2vec 


5.3.4 Biased Embeddings 


Recently, we have become aware of the potential biases that may implicitly exist 
within embedding models. Learned word representations are only as good as the 
data that they were trained on—that is, they will capture the semantic and syntac- 
tic context inherent in the training data. For instance, recent studies have revealed 
racial and gender biases within popular word embedding models such as GloVe and 
word2vec trained on a broad news corpus: 


v(nurse) © v(doctor) — v(father) + v(mother) 
v(Leroy) © v(Brad) — v(happy) + v(angry) 


5.3.5 Other Limitations 


A further limitation of word embedding models relates to the batch nature of train- 
ing and the practicality of augmenting an existing model with new data or ex- 
panded vocabulary. Doing so requires us to retrain an embedding model with both 
the original data and new data—the entire data needs to be available and embed- 
dings recomputed. An online learning approach to word embeddings would allow 
them to be more practical. 


5.4 Beyond Word Embeddings 


Recent interest in word embedding models has led to practical adaptations that can 
leverage word compositionality (subword embeddings) and address memory con- 
straints (word2bits). Others have extended word2vec to learn distributed represen- 
tations of sentences, documents (DM and DBOW), and concepts (RDF2Vec). Inter- 
est has also given rise to Bayesian approaches that map words to latent probability 
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densities (Gaussian embeddings) as well as hyperbolic space (Poincaré embed- 
dings). We examine these innovations in the next sections. 


5.4.1 Subword Embeddings 


Methods such as word2vec or GloVe ignore the internal structure of words and 
associate each word (or word sense) to a separate vector representation. For mor- 
phologically rich languages, there may be a significant number of rare word forms 
such that either a very large vocabulary must be maintained or a significant num- 
ber of words are treated as out-of-vocabulary (OOV). As previously stated, out-of- 
vocabulary words can significantly impact performance due to the loss of context 
from rare words [Bak18]. An approach that can help deal with this limitation is 
the use of subword embeddings [Boj+16], where vector representations z, are as- 
sociated with character n-grams g and words w; are represented by the sum of the 


n-gram vectors (Fig. 5.16). 
wi= >, % (5.68) 


geGy 

For instance, the vector for the word indict consists of the sum of the vectors for the 
n-grams {ind,ndi,dic,ict,indi,ndic,dict, indic,ndict,indict} when n € (3,6). Thus, the 
set of n-grams is a superset of the vocabulary of the corpus (Fig. 5.17). As n-grams 
are shared across words, this allows for representation of even unseen words since 
an OOV word will still consist of n-grams that will have representations. Subword 
embeddings can significantly boost NLP tasks such as language modeling and text 
classification. 


5.4.2 Word Vector Quantization 


Even for a small vocabulary, word models can require a significant amount of mem- 
ory and storage. Consider a 150,000-word vocabulary. A 300-dimensional continu- 
ous 64-bit representation of these words can easily occupy over 360 megabytes. It is 
possible to learn a compact representation by applying quantization to word vectors. 
In some cases, compression ratios of 8x-16x are possible relative to full-precision 
word vectors while maintaining comparable performance [Lam18]. Furthermore, 
the quantization function can act as a regularizer that can improve generalization 
[Lam18]. 

Word2Bits is an approach that adapts the word2vec CBOW method by introduc- 
ing a quantization element to its loss function: 


I ivenbiced (u)?, 9!” ) = —log ( (us?) 8”) ) 


— D log (c ((—u{?)T#!) ) (5.69) 
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Fig. 5.16: Word and subword vectors 


where 
a? = OQpitlevel (Uy) (5.70) 


9) = » Qhbitlevel (v;) (5 ft 1) 


—w+i<i<w+o,i4o 


Here, w is the context window width, Qpitjeye; 18 the quantization function, u, and ¥, 


(q) (q) 


are the target and context word vectors, respectively, and u,” and ¥.” are their quan- 
tized equivalents. The Heaviside step function is commonly chosen as the quanti- 
zation function Qpitieye;. Similar to the standard CBOW algorithm, the loss function 
is optimized over the target u; and context v; over the corpus. The gradient updates 
for the target word u,, negative sampling word u;, and context word v; are given by: 


. OJ quantized (u)?, 9) 7 Od iantieed (uh? 9?) 


~ —— _ auld (5.72) 
OJ quantized (ul? 9”) OJ quantized (wi? 42”) 
og ee (5.73) 


Ou; Aaa? 


1 
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Fig. 5.17: Sub-word embeddings (character n-grams with n = 1,2,3) 
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The final vector for each word is expressed as Qpitieve (Uj + Vj) Whose elements can 
take on one of 27/eve! values and requires only bitlevel bits to represent in compar- 
ison with full-precision 32/64 bits. Studies have shown that quantized vectors can 
perform comparably on word similarity tasks and question answering tasks even 
with 16x compression. 


5.4.3 Sentence Embeddings 


While word embedding models capture semantic relationships between words, they 
lose this ability at the sentence level. Sentence representations are usually expressed 
the sum of the word vectors of the sentence. This bag-of-words approach has a 
major flaw in that different sentences can have identical representations as long 
as the same words are used. To incorporate word order information, people have 
attempted to use bag-of-n-grams approaches that can capture short order contexts. 
However, at the sentence level, they are limited by data sparsity and suffer from 
poor generalization due to high dimensionality. 

Le and Mikolov in 2014 [LM14] proposed an unsupervised algorithm to learn 
useful representations of sentences that capture word order information. Their ap- 
proach was inspired by Word2Vec for learning word vectors and is commonly 
known as doc2vec. It generates fixed-length feature representations from variable- 
length pieces of text, making it useful for application to sentences, paragraphs, sec- 
tions, or entire documents. The key to the approach is to associate every paragraph 
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with a unique paragraph vector u', which is averaged with the word vectors w’, of 
que paragrap g j 


the J words in the paragraph to yield a representation of the paragraph p’: 


p=u'+ > wi (5.75) 
j=lJ 


Note that the term paragraph can refer to a sentence or document as well. This 
approach is termed a distributed memory model (DM) (Fig. 5.18). The paragraph 
vector u’ can be thought of acting as a memory that remembers word order context. 

During training, a sliding window of context words C and the paragraph vector 
p’ are used to predict the next word in the paragraph context. Both paragraph vec- 
tors and word vectors are trained via backpropagation. While the paragraph vector 
is unique to each paragraph and shared across all contexts generated from the same 
paragraph, the word vectors are shared across the entire corpus. It is notable that the 
DM architecture resembles the CBOW architecture of word2vec, except with the 
added paragraph context vector. Le and Mikolov also presented an architecture they 
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Fig. 5.18: Distributed memory architecture for paragraph vectors 


called distributed bag-of-words (DBOW) which used only the paragraph context 
vector to predict the words in the paragraph (Fig. 5.19). This simple model is anal- 
ogous to the skip-gram version of word2vec, except the paragraph vector is used to 
predict all the words paragraph instead of using the target word to predict the con- 
text words. As in the skip-gram model, DBOW is very computationally and memory 
efficient. Empirical results have shown that both DM and DBOW outperform bag- 
of-words and bag-of-n-gram models for text representations. Furthermore, averag- 
ing the DM and DBOW vector representations often yields the best performance 
overall. 
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Fig. 5.19: Distributed bag-of-words architecture for paragraph vectors 


5.4.4 Concept Embeddings 


A key characteristic of embedding models is their ability to capture semantic re- 
lationships using simple vector arithmetic. Leveraging this idea, embedding mod- 
els have recently been developed to map ontological concepts into a vector space 
[Als+18]. These embeddings can reflect the entity types, semantics, and relation- 
ships of a knowledge graph. RDF2Vec is an approach for learning embeddings of 
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Fig. 5.20: Knowledge graph 


entities in knowledge graphs (Fig. 5.20). An RDF is a statement with three con- 
stituent parts: a subject, predicate, and object. A collection of these can be used to 
build a knowledge graph. RDF2Vec converts RDF graphs into a set of sequences 
using graph walks/subtree graph kernels and then applies the word2vec algorithm 
to map entities to latent representations. In the resulting embedding space, entities 
that share a background concept are clustered close to each other, such that entities 
such as “New York” are close to entities such as “city.” 
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TransE was proposed as a general method that aims to specifically represent 
relationships between entities as translations in an embedding space. The key notion 
is that, given a set of relationships in the from (head, label, tail), the vector of the tail 
entity should be close to the vector of the head entity plus the vector of the label: 


Viail © VVhead + Viabel (5.76) 


TransE is trained in similar manner to negative sampling by minimizing the loss 
function over a set of triplets S using stochastic gradient descent: 


J@)= SY max (d(h+1,t)—d(h' +1,1'),0) + R(@) (5.77) 
(h,1,t) ES (h,'1,t') es! 


where d is a dissimilarity measure and R is a regularizer (typically Ly norm). 
Figure 5.21 illustrates vector translations as relationships from the knowledge graph 
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Fig. 5.21: Relationships mapped to vector translations by TransE method 


in Fig.5.18 embedded by the TransE method. For instance, the following transla- 
tions hold: 


v(teller) + v(doctor)—vw(Jill) + v(Jack) 
v(Jill) & v(Jack)—textbfv(Dover) + v(Erie) 


With the ability to scale to large datasets, TransE and related methods [Bor+13] are 
useful for both relation extraction and linked prediction as well as NLP tasks. 
5.4.5 Retrofitting with Semantic Lexicons 


To take advantage of relational information contained in lexical databases such 
as WordNet or FrameNet, Faruqui et. al. [Far+14] proposed a method to refine 
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word embeddings such that lexically linked words have similar vector represen- 
tations. This refinement method, commonly called retrofitting with semantic lexi- 
cons, makes no assumptions on how these vector representations are learned and 
is applicable across spectral and neural approaches. Given a vocabulary of words 
(w1,W2,---,Wn), a Set of semantic relations expressed as an undirected graph with 
edges (w;,w;), and a set of learned word vectors q; for each w;, the goal is to infer a 
new Set of word vectors q; such that they are close in distance to their counterparts in 
q; and to adjacent vertices w;. With a Euclidean distance measure, this is equivalent 
to minimizing the objective function: 


f= 


n 
ai\lqi—4i\l + > Billa: — Gl? (5.78) 
i=l (i,)€E 


where o; and B;; reflect the relative strength of associations. Retrofitting can be 
accomplished iteratively with the following update: 


= DY i:i,j)ee Big Qi + G4: (5.79) 

Di:(i,)eE Bij + Oi 
Retrofitting has led to substantial improvements in many lexical semantic evaluation 
tasks and is useful where external knowledge can be leveraged. 


5.4.6 Gaussian Embeddings 


Rather than assuming that embedding models map words to point vectors in la- 
tent representation space, words can be mapped to continuous probability densities. 
This gives rise to several interesting advantages, as they can inherently capture un- 
certainty and asymmetry of semantic relationships between words. 


5.4.6.1 Word2Gauss 


One such approach is Word2Gauss, which maps each word w in the vocabulary 
D and context word c in the dictionary C to a Gaussian distribution over a latent 
embedding space. The vectors of this space are termed word types and the words 
observed are word instances. Word2Gauss presents two methods for generating em- 
beddings. The first way is to replace the notion of cosine distance for point vectors 
in latent density space by inner product E' between two Gaussian densities: 


ECE; = N(x; Ui, 21) N(x; uj, 2; )dx = N(O; ui — wy, 2; + 2;) (5.80) 


xElR” 


where N(x; ;,2;) and N(x;;,2;) are the densities of a target and context word. 
This is a Symmetric measure that is computationally efficient, but it cannot model 
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asymmetry relationships between words. The second, more expressive method, is 
to model similarity through the notion of KL divergence and train to optimize on a 
loss function: 


N(x; Hj, 2) 


7 =. 5.81 
New iB) ee 


Dxi(Njl|Ni) -| N(x; Hi, Xi) log 
xeElR” 
This KL-divergence method enables Gaussian embeddings to incorporate the notion 
of entailment, as low KL-divergence from w to c implies that c entails w. Further- 
more, as KL-divergence is asymmetric, these embeddings can encode asymmetric 
similarity in the word types. 

Word2Gauss has been shown to perform significantly better at asymmetric tasks 
such as entailment [VM14]. Still, unimodal Gaussian densities do not adequately 
deal with polysemous words, and computational complexity during training is an 
important consideration. 


5.4.6.2 Bayesian Skip-Gram 


A recent approach that builds upon the notion of words embeddings as probability 
densities takes a generative Bayesian approach. The Bayesian skip-gram model 
(BSG) models each word representation in the form of a Bayesian model generated 
from prior densities associated with each occurrence of a given word in the corpus. 
By incorporating context, the BSG model can overcome the polysemy limitation 
of Word2Gauss. In fact, it can potentially model an infinite set of continuous word 
senses. 

For a target word w and a set of context words c, the BSG model assumes a prior 
distribution pg(z|w) and posterior distribution gg(z\c,w) in the form of a Gaussian 
distribution: 


Po(z|w) = N(z| My, Xv) (5.82) 
qo(z\c,w) = N (z|Mq,2q) (5.83) 


with diagonal covariance matrices 2, and 2, and z are latent vectors drawn from 
the prior (Fig. 5.22). The larger the covariance matrix 2, values, the more uncertain 
the meaning of target word w in context c. 

The BSG model aims to maximize the probability pg (c|w) of words in a context 
window given a target word w, which is analogous to the skip-gram model. It is 
trained by taking a target word w, sampling its latent meaning z from the prior, and 
drawing context words ¢ from pg(c|z). The goal is to maximize the log-likelihood 
function: 


C 
log pe(e|w) = log | T] 0(ejl2)p0(zw)dz (5.84) 
j=l 
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where C is the context window size and c; are the context words for target word w. 
For computational simplicity, the BSG model is trained by optimizing on the lower 
bound of the log-likelihood, given by: 


J(0) = > (Dez [a6 ||N(z; Me, Ze, )| — Daz |go||N(2s Ue, Xc;)]) 
ia 


— Dxx |4o||Po(z|w)| (5.85) 


where the sum is over pairs of positive c; and negative C,; context words. The re- 
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Fig. 5.22: Context specific densities in a Bayesian skip-gram model 
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sult of training are embeddings associated with the prior pg that represent a word 
type and associated with the posterior gg that encode dynamic context. In compari- 
son with Word2Gauss, BSG can provide better context-sensitivity, as in the case of 
polysemy. 


5.4.7 Hyperbolic Embeddings 


The ability of embedding models to model complex patterns is constrained by the 
dimensionality of the embedding space. Furthermore, it has been shown that em- 
bedding models do poorly in capturing latent hierarchical relationships. In an ef- 
fort to overcome these limitations, Nickel and Keila [NK17] proposed Poincaré 
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embeddings as a method to effectively increase the representation capacity while 
learning latent hierarchy. Their approach is based on learning representations in 
hyperbolic space instead of Euclidean space. Hyperbolic geometry can effectively 
and efficiently model hierarchical structures such as trees (Fig. 5.23). In fact, trees 
can be thought of as instances of discrete hyperbolic spaces. It is notable that the 
dimensionality needed to represent trees grows linearly in hyperbolic space while 
quadratically in Euclidean space. 

The Poincaré embedding model learns hierarchical representations by mapping 
words to an n-dimensional unit ball BY = {x € R¢ | ||x|| < 1}, where ||x|| is the 
Euclidean norm. In hyperbolic space, the distance between two points u,v € IR“ 1s: 


2 
d(u,v) =cosh7! (1427, oy) 5.86 
ee a= Tl) 0 =I) Oe) 


Note that as ||x|| approaches 1, the distance to other points grows exponentially. The 





Fig. 5.23: Embedding a tree within a hyperbolic space 


notion of straight lines in Euclidean space map to geodesics in B¢ (Fig. 5.24). Such a 
formulation allows the modeling of trees by placing the root node at/near the origin 
and leaf nodes at the boundary. During training, the model learn representations O: 


@ ={0,}"_, where 6; € B’, (5.87) 


by minimizing the loss function L(@) using a negative sampling approach on a set 
of data D = {(u,v)}: 


ed (u.v) 


L(Q)= » log 


(5.88) 
(u,v)€D dw! EN(u) © 


d(u,v’) 
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where V(u) = {v’|(u,v’) ¢ D} is the set of negative samples. Nickel and Keila for- 
mulation required the use of stochastic Riemannian optimization methods to in- 
duce embeddings. These methods like Riemannian stochastic gradient descent suf- 
fer from several limitations and require an extra projection step to bring the embed- 
dings back into the unit hyperball. Furthermore, they are computationally expensive 
to train, which make them less feasible for a large text corpus. Recently, Dhingra 





Fig. 5.24: Whereas shortest paths are straight lines in Euclidean space, they are 
curved lines within a hyperbolic space and are called geodesics 


et al. generalized hyperbolic embeddings by incorporating a parametric approach 
based on learning encoder functions fg that map word sequences to embeddings 
on the Poincaré ball B“ [Dhi+18]. The method is predicated on the notion that se- 
mantically general concepts occur in a wider range of contexts, while semantically 
specific concepts occur in a narrower range. By using a simple parameterization 
of the direction and norm of the hyperbolic embeddings and applying a sigmoid 
function to the norm, this method allows embeddings to be induced using popular 
optimization methods with only a modified distance metric and loss function. 


5.5 Applications 


Word embedding models have led to the improvement of state-of-the-art scores in 
a wide range of NLP tasks. In many applications, traditional methods have been 
almost completely replaced by word embedding approaches. Their ability to map 
variable-length sequences to fix-length representations has opened the door for the 
application of deep learning to natural language processing. In the next sections, we 
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provide simple examples of how word embeddings can be applied to NLP, while 
leaving deep learning approaches to later chapters. 


5.5.1 Classification 


Text classification forms the basis of many important tasks in NLP. Traditional 
linear-classifier bag-of-word approaches such as Naive Bayes or logistic regression 
can perform well for text classification. However, they suffer from the inability to 
generalize to words and phrases unseen in the training data. Embedding models 
provide the ability to overcome this shortcoming. By leveraging pretrained em- 
beddings—learning word representations on a separate large corpus—we can build 
classifiers that generalize across text. 

The FastText model proposed by Joulin et al. [Jou+ 16a] is an example of a text- 
classification model that leverages word embeddings (Fig. 5.25). The first phase of 
FastText learns word representations on a large corpus, in effect capturing the se- 
mantic relationships on a wide range of vocabulary. These embeddings are then used 
during the classifier training phase where words of a document are mapped to vec- 
tors using these embeddings and these vectors are subsequently averaged together 
to form a latent representation of the document. These latent representations and 
their labels form the training set for a softmax or hierarchical softmax classifier. 
FastText, as its name implies, is computationally efficient, and reportedly able to 
train on a billion words in 10 min while achieving near state-of-the-art performance 
[Jou+ 16a]. 


5.5.2 Document Clustering 


Traditional document clustering based on bag-of-words leads often to excessively 
high dimensionality and data sparsity. Topic modeling methods such as latent se- 
mantic analysis (LSA) and latent Dirichlet allocation (LDA) can be applied to doc- 
ument clustering, but either ignore word co-occurrence or suffer from computational 
scalability. 

We have seen how word embeddings can be used to create latent representa- 
tions of documents. These representations capture the semantic information within 
the documents, and it is fairly easy to perform k-means or another conventional 
clustering method to identify document clusters. Empirical evidence has shown the 
superiority of using embeddings to perform document clustering [LM14] over bag- 
of-words or topic model approaches. Use of pre-trained embeddings can enhance 
the semantic information available to cluster documents. 
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Fig. 5.25: FastText model 


5.5.3 Language Modeling 


As noted previously, language models are strongly related to the training of em- 
bedding models, given that both predict a target word given a set of context 
words. An n-gram language model predicts a word w; given the previous words 
W-1,Wr—25---,Wr—n- Training of an n-gram language model is equivalent to maxi- 
mizing the negative log-likelihood: 


ii 
J(@) = Y log P(Wz|Wr-1, Wr-25-- +» We—n4+1) (5.89) 


t=1 


In comparison, training of the CBOW word2vec model is equivalent to maximizing 
the objective function: 


1 T 
J(@) = 7 Y log P(Wi|Wr-ny ++) Wtr—-1) Wet15 +++) Wrtn) (5.90) 
t=1 


So the language model predicts a target word based on the previous n context words, 

while CBOW predicts a target word based on the n context words on each side. 
Embedding methods excel and language modeling tasks and have led to deep 

neural network approaches leading the state-of-the-art in performance [MH09]. 
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5.5.4 Text Anomaly Detection 


Anomaly detection plays an important part in many applications. Unfortunately, 
anomaly detection of text is generally difficult to model due to data sparsity and 
the extremely high dimensionality nature of text. Existing methods on structured 
data fall into distance-based methods, density based methods, and subspace methods 
[Kan+17]. However, these methods do not generalize easily to unstructured text 
data. While matrix factorization and topic modeling approaches can bridge this gap, 
they can still suffer from high dimensionality and noise as many words tend to be 
topically irrelevant to the context of a document. Embedding models can map text 


P(outlier) 


K-means ——® Distance metrics 


i, 





Corpus Document 


Fig. 5.26: Embedding-based outlier detection 


sequences into dense representations that permit the application of distance-based 
and density-based methods [Che+16]. An example of an embedding approach is 
illustrated in Fig. 5.26. At training time, the model learns text representations and 
clusters entities via k-means, such that clusters and dense regions within the latent 
entity space are identified. At prediction time, a document can be mapped to its 
latent representation vg. A distance-based approach could calculate the distance of 
this representation vy to the cluster centroids c; identified at training time, and flag 
the document if the distance exceeds a threshold T (Fig. 5.27): 


min ||Vg —¢;|| > T — anomaly (5.91) 
Cj 


A density-based approach could count the number of entities within a small neigh- 
borhood of vg and flag the document if the count fell below a threshold T. 
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Fig. 5.27: Outlier detection model 


5.5.5 Contextualized Embeddings 


In the past year, a number of new methods leveraging contextualized embeddings 
have been proposed. These are based on the notion that embeddings for words 
should be based on contexts in which they are used. This context can be the po- 
sition and presence of surrounding words in the sentence, paragraph, or document. 
By generatively pre-training contextualized embeddings and language models on 
massive amounts of data, it became possible to discriminatively fine-tune models 
on a variety of tasks and achieve state-of-the-art results. This has been commonly 
referred to as “NLP’s ImageNet moment” [HR18]. 

One of the notable methods is the Transformer model, an attention-based 
stacked encoder—decoder architecture (see Chap.7) that is pre-trained at scale. 
Vaswani et al. [Vas+17a] applied this model to the task of machine translation and 
broke performance records. 

Another important method is ELMo, short for Embeddings from Language Mod- 
els, which generates a set of contextualized word representations that effectively 
capture syntax and semantics as well as polysemy. These representations are ac- 
tually the internal states of a bidirectional, character-based LSTM language model 
that is pre-trained on a large external corpus (see Chap. 10). 

Building on the power of Transformers, a method has recently been proposed 
called BERT, short for Bidirectional Encoder Representations from Transform- 
ers. BERT is a transformer-based, masked language model that is bidirectionally 
trained to generate deep contextualized word embeddings that capture left-to-right 
and right-to-left contexts. These embeddings require very little fine-tuning to excel 
at downstream complex tasks such as entailment or question-answering [Dev+18]. 
BERT has broken multiple performance records and represents one of the bright 
breakthroughs in language representations today. 
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5.6 Case Study 


We start off by a detailed look into the word2vec algorithm and examine a python 
implementation of the skip-gram model with negative sampling. Once the concepts 
underpinning word2vec are examined, we will use the Gensim package to speed 
up training time and investigate the translational properties of word embeddings. 
We will examine GloVe embeddings as an alternative to word2vec. Both methods, 
however, are unable to handle antonymy, polysemy, and word-sense disambigua- 
tion. We consider document clustering by using an embeddings approach. Lastly, 
we study how an embedding method like sense2vec can better handle word sense 
disambiguation. 


5.6.1 Software Tools and Libraries 


In this case study, we will be examining the inner operations of word2vec’s skip- 
gram and negative sampling approach as well as GloVe embeddings with python. 
We will also leverage the popular nltk, gensim, glove, and spaCy libraries for our 
analysis. NLTK is a popular open-source toolkit for natural language processing 
and text analytics. The gensim library is an open-source toolkit for vector space 
modeling and topic modeling implemented in Python with Cython performance ac- 
celeration. The glove library is an efficient open-source implementation of GloVe in 
python. SpaCy is a fast open-source NLP library written in Python and Cython for 
part-of-speech tagging and named entity recognition. 

For our analysis, we will leverage the Open American National Corpus, which 
consists of roughly 15 million spoken and written words from a variety of sources. 
Specifically, we will be using the subcorpus which consists of 4531 Slate magazine 
articles from 1996 to 2000 (approximately 4.2 million words). 


5.6.2 Exploratory Data Analysis 


Let’s take a look at some basic statistics on this dataset, such as document length 
and sentence length (Figs. 5.28 and 5.29). By examining word-frequency by looking 
at the top 1000 terms in this corpus (Fig. 5.30), we see that the top 100 terms are 
what we typically consider stop-words (Table 5.1). They are common across most 
sentences and do not capture much, if any, semantic meaning. As we move further 
down the list, we start to see words that play a more important role in conveying the 
meaning within a sentence or document. 
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Fig. 5.28: Document length 
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Fig. 5.29: Sentence length 


5.6.3 Learning Word Embeddings 


Our goal is to train a set of word embeddings for the corpus above. Let’s build a 
skip-gram model with negative sampling, followed by a GloVe model. Before we 
train either model, we see that there are 77,440 unique words in the preprocessed 
4.86 million word corpus. 
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Word Frequency 


Othe 266,007 
1 of 115,973 
2 - 114,156 
3 to 107,951 
4 a 100,993 
5 and 96,375 
6 in 74,561 
7 that 64,448 
8 is 51,590 
9 it 38,175 


990 Eyes 500 
991 Troops 499 
992 Raise 499 
993 Pundits 499 
994 Calling 498 
995 de 498 
996 Sports 498 
997 Strategy 497 
998 Numbers 496 
999 Argues 496 
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Fig. 5.30: Word frequency histogram 
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5.6.3.1 Word2Vec 


We are now ready to train the neural network of the word2vec model. Let’s define 
our model parameters: 


dim = dimension of the word vectors 

win = context window size (number of tokens) 

start_alpha = starting learning rate 

neg = number of samples for negative sampling 

min_count = minimum mentions to be included in vocabulary 


We can reduce the size of this vocabulary by filtering out rare words. If we apply a 
minimum count threshold of 5 mentions in the corpus, we find that our vocabulary 
size drops down to 31,599, such that 45,842 words will be considered OOV. We will 
be mapping all of these words to a special out-of-vocabulary token. 


truncated = [] 
truncated .append(VocabWord( ’<unk>’ ) ) 
3} unk_hash = O 


5} count_unk = O 
for token in vocab_items: 
if token.count < min_count: 
count_unk += | 
truncated[unk_hash ].count += token.count 
elive. 
truncated .append (token ) 


3} truncated.sort(key=lambda token : token.count, reverse=True ) 


s| vocab_hash = {} 
for 1, token in enumerate(truncated ): 
vocab_hash[token.word] = 1 





vocab_items = truncated 
vocab_hash = vocab_hash 
vocab_size = len(vocab_items ) 
print(’Unknown vocab size:’, count_unk ) 
3} print(’ Truncated vocab size: %d’ % vocab-_size) 





5.6.3.2 Negative Sampling 


To speed up training, let’s create a negative sampling lookup table that we will use 
during training. 





i} power = 0.75 
2}norm = sum([math.pow(t.count, power) for t in vocab _items ]) 


4 teal Diem s ize eset (les ) 
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5} table = np.zeros(table_size , dtype=np. int ) 


0 
sla = 0 
j, unigram in enumerate(vocab-items ): 
0 p t= float(math.pow(unigram.count, power) )/norm 
l while 1 = tablezssize and float(i1) / table_size = p: 
2 ba bile |=] 
i += 1 


5} def sample(table ,count): 





6 indices = np.random.randint(low=0, high=len(table), size= 
count ) 
return [table[i] for i in indices ] 





5.6.3.3 Training the Model 


We are now ready to train the word2vec model. The approach is to train a two- 
layer (synO, synl) neural network by iterating over the sentences in the corpus and 
adjusting lawyer weights to maximize the probabilities of context words given a 
target word (skip-gram) with negative sampling. After completion, the weights of 
the hidden layer synO are the word embeddings that we seek. 





i}tmp = np.random.uniform(low=—0.5/dim, high=0.5/dim, size=( 
vocab_size , dim) ) 

synO = np.ctypeslib.as_ctypes (tmp) 

synO = np. array (syn0) 


i) 


5}tmp = np. zeros(shape=(vocab_size , dim) ) 
6| Synl = np.ctypeslib.as_ctypes (tmp) 
7} Synl = np.array(synl ) 


9} current_sent = O 
o} truncated_vocabulary = [x.word for x in vocab_items | 
i} corpus = df[’text’].tolist() 


3} While current_sent < df.count() [0]: 

4 line = corpus[current_sent ] 

5 sent = [vocab_hash[token] if token in truncated_vocabulary 
else vocab_hash[ ’<unk>’ ] 

6 for tokem im | <bol- | + line. split@- + | <eol— i] 








i for sent_pos , token in enumerate(sent): 

8 

9 current_win = np.random.randint(low=1, high=win+1) 

20 context_start = max(sent_pos ——current_win, QO) 

21 context_end = min(sent_pos + current_win+l, len(sent)) 
22 context = sent[context_start:sent_pos] + sent[sent_pos 


+1:context_end ] 
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24 for context_word in context: 

25 embed = np. zeros (DIM) 

26 classifiers = [(token, 1)] + [(target, O) for 
target in table.sample(neg) | 

27 for target, label in classifiers: 

28 vi np.dot(synO[context_word], synl[target ]) 

29 p sigmoid (z) 

30 g = alpha x (label—p) 

31 embed += g x synl[target | 

32 synl[target] += g «x synO[context_word | 

synO[context_word] += embed 


35 word_count += | 

36 current_sent += 1 

37 if -curremtsents oo 000) —— 

38 Primt( \eReadine sentence 7d % current-scnt) 


4] embedding = dict(zip(truncated_vocabulary ,syn0) ) 





The semantic translation properties of these embeddings are noteworthy. Let’s ex- 
amine the cosine similarity between two similar words (man, woman) and two dis- 
similar words (candy, social). We would expect the similar words to exhibit higher 
similarity. 


e dist(man, woman) = 0.01258108 
e dist(candy, social) = 0.05319491 


5.6.3.4 Visualize Embeddings 


We can visualize the word embeddings using the T-SNE algorithm to map the em- 
beddings to 2D space. Note that T-SNE is a dimensionality reduction technique that 
preserves notions of proximity within a vector space (points close together in 2D are 
close in proximity in higher dimensions). The figure below shows the relationships 
of a 300-word sample from the vocabulary (Fig. 5.31). 


5.6.3.5 Using the Gensim package 


The python code above is useful for understanding principles, but is not the fastest 
to run. The original word2vec package was written in C++ to facilitate rapid training 
speed over multiple cores. The gensim package provides an API to the word2vec li- 
brary, as well as several useful methods to examine vectors neighborhoods. Let’s see 
how we can use gensim to train on the sample data corpus. Gensim expects us to pro- 
vide a set of documents as a list of list of tokens. We will call the simple_preprocess() 
method of gensim to remove punctuation, special and uppercase characters. With 
the wrapper API provided by the gensim package, training word2vec is as simple as 
defining a model and passing the set of training documents. 
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Fig. 5.31: word2vec embeddings visualized using T-SNE 


documents = [gensim. utils .simple_preprocess(df[’text’].iloc[1i 
]) for i in range(len(df))] 
2} model = gensim.models.Word2Vec( documents , 
size=100, 
window=10, 


min_count=2, 
workers =10) 
model.train(documents, total_examples=len(documents), epochs 
=10) 





5.6.3.6 Similarity 


Let’s assess the quality of the learned word embeddings by examining word neigh- 
borhoods. If we look at the most similar words to “man” or “book’,’ we find highly 
similar words in their neighborhoods. So far so good. 


model.wv. most_similar(”man” ,topn=5) 
[(’> guy’, 0.6880463361740112), 
(’woman’, 0.6301935315132141), 


(’ person’, 0.6296881437301636) , 
(’>soldier’, 0.5808842182159424) , 
(’>someone’, 0.5552011728286743) ] 
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model.wv. most_similar(” book” ,topn=5) 
[(’> books’, 0.7232613563537598) , 
(’novel’, 0.6448987126350403), 


(> biography’, 0.6039375066757202) , 
(’memoir’, 0.6010321378707886) , 
(> chapter’, 0.5646576881408691) | 





Let’s look at some polysemous words. The similar words to the word “bass” reflect 
the music definition of bass. That is, they only capture a single word sense (there 
are no words related to the aquatic definition of bass). Similarly, words similar to 
“bank” all reflect its financial word sense, but no seashores or riverbeds. This is one 
of the major shortcomings of word2vec. 





i| model.wv. most_similar(” bass” ,topn=5) 
2} [C’? guitar’, 0.6996911764144897) , 

3} (’ solo’, 0.6786242723464966) , 

4| (’ blazer’, 0.6665750741958618) , 

s| (’ roars’, 0.6658747792243958) , 

6| (’ corduroy’, 0.6525936126708984) | 


model.wv. most_similar(” bank” ,topn=5) 
[(’> banks’, 0.6580432653427124) , 
(’ bankers’, 0.5862468481063843) , 


(?imf’, 0.5782995223999023) , 
( weserves , 0.55468 75), 
(?loans’, 0.5457302331924438) ] 





We can examine the semantic translation properties in more detail with some vec- 
tor algebra. If we start with the word “son” and subtract “man” and add “woman,” 
we indeed find that “daughter” is the closest word to the resulting sum. Similarly, 
if we invert the operation and start with the word “daughter” and subtract “woman” 
and add “man,” we find that “son” is closest to the sum. Note that reciprocity is not 
guaranteed with word2vec. 


model.wv. similar_by_vector (model.wv[’son’]—model.wv[’ man’ | 
+model.wv[’ woman’ ], 
topn=5) 
[( > daughter’, 0.7489624619483948) , 
(’sister’, 0.7321654558181763), 


("mother = 0-7 243343591690063). 
(’ boyfriend’, 0.7229076623916626) , 
(’ lover’, 0.7120637893676758) | 





i} model.wv. similar_by_vector(model.wv[’ daughter’ | 

2 —model.wv[ ’ woman’ | 
3 +model.wv[ ’man’ ],topn=5) 
41 [C’ son’, 0.7144862413406372) , 

s| (’daughter’, 0.6668421030044556) , 

6| (’man’, 0.6652499437332153), 

7} (’ grandfather’, 0.5896619558334351), 
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s| (’ father’, 0.585667073726654) | 


We can also see that word2vec captures geographic similarities by taking the word 
“paris, subtracting “france” and adding “russia.” There resulting sum is close to 
what we expect—“‘moscow.” 


model.wv. similar_by_vector(model.wv[’ paris ’ | 
—model.wv[’ france ’ ] 
+model.wv[’ russia ’] , topn=5) 


[(’ russia’, 0.7788714170455933) , 


(’moscow’, 0.6269053220748901) , 
(> brazil’, 0.6154285669326782) , 
(?japan’, 0.592476487159729) , 

(> gazeta’, 0.5799405574798584) | 





We have previously discussed that word embeddings generated by word2vec are 
unable to distinguish antonyms, as these words often share the same context words 
in normal usage and consequentially have learned embeddings close to each other. 
For instance, the most similar word to “large” is “small,” or the most similar word 
to “hard” is “easy.” Antonymy is hard! 


model.wv. most_similar(” large”,topn=5) 
[(’> small’, 0.726446270942688) , 
(’enormous’, 0.5439934134483337) , 
(*huge’, 0.5070887207984924) , 
(>vast’, 0.5017688870429993) , 
(> size’, 0.48968151211738586) | 


model.wv. most_similar(” hard” ,topn=5) 
[(’> easy’, 0.6564798355102539) , 
(’ difficult ’, 0.6085934638977051), 
(>tempting’, 0.5201482772827148) , 
(?>impossible’, 0.5099537372589111), 
(’easier’, 0.4868208169937134) | 





5.6.3.7 GloVe Embeddings 


Whereas word2vec captures the local context of words within sentences, GloVe 
embeddings can additionally account for global context across the corpus. Let’s 
take a deeper dive on how to calculate GloVe embeddings. We begin by building a 
vocabulary dictionary from the corpus. 





i}from collections import Counter 


3} vocab_count = Counter () 
4|for line in corpus: 
5 tokens = line. strip().split() 
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6 vocab_count.update (tokens ) 
71} vocab = {word: (i, freq) for i, (word, freq) in enumerate ( 
vocab_count.items () )} 





5.6.3.8 Co-occurrence Matrix 


Let’s build the word co-occurrence matrix from the corpus. Note that word occur- 
rences go both ways, from the main word to context, and vice versa. For smaller 
values of the context window, this matrix is expected to be sparse. 


1|# Build co—occurrence matrix 
2} from scipy import sparse 


4] min_count = 


5| window_size = 5 

6 

7) vocab_size = len(vocab ) 

s}id2word = dict((1, word) for word, (1, _) in vocab.items() ) 


o| occurrence = sparse.lil_matrix (( vocab_size , vocab_size) ,dtype= 
np. float64 ) 


i}for 1, line in enumerate(corpus ): 





2 tokens = line. split () 

3 token_ids = [vocab[word][0] for word in tokens | 

4 

5 for center1 , center_id in enumerate (token _ids ): 

16 context_ids=token_ids[max(0, center_1——window-size ) 


Center a1 | 
17 contexts_len = len(context_ids ) 


19 for heft. leit sid in enumerate come xteids )i. 


20 distance = contexts_len ——left_i 
21 increment = 1.0 / float(distance ) 
09 occurrence[center_id, left_id] += increment 


23 occurrence[left_id , center_id] += increment 
24 if i % 10000 == 0: 
25 print( Processing sentence “dG 1) 


27} def occur_matrix(vocab, coccurrence , min_count): 


28 for 1, (row, data) in enumerate(zip(coccurrence.rows, 
coccurrence.data)): 

29 if min_count is not None and vocab[id2word[i]][1] < 
min_count: 

30 continue 

31 for data_idx , j in enumerate(row): 

32 if min_count is not None and vocab[id2word[j]][1] < 


min_count 
33 ‘continue 
34 yield 1, j, data[data_idx |] 
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5.6.3.9 GloVe Training 


We can now train the embeddings by iterating over the documents (sentences) in the 
corpus. 


l 


tN 





6 





from random import shuffle 
from math import log 
import pickle 


iterations = 30 

dim = 100 
learning_rate = 0.05 
x_max = 100 

alpha = 0.75 


vocab_size = len(vocab ) 
W = (np.random.rand(vocab-_size «x 2, dim)———0.5)/float(dim + 1) 
biases = (np.random.rand(vocab_size * 2)———0.5)/float(dim + 1) 
gradient_squared = np.ones((vocab_size *« 2, dim), dtype=np. 
float64 ) 
gradient_squared_biases = np.ones(vocab_size *« 2, dtype=np. 
float64 ) 
data = [(W[i_main], W[i_context + vocab-_size ], 
biases[i_main : i_main + l], 
biases[i_context + vocab_size : i_context + 


vocab_size + 1], 
gradient_squared[i_main], gradient_squared [ 
i1_context + vocab_size], 
gradient_squared_biases[i_main : i_main + Il], 
gradient_squared_biases[i_context + vocab-_size 
1_context + vocab_size 


cooccurrence ) 
for i-main, i_context , cooccurrence in comatrix ] 


for i in range(iterations ): 
global_cost = 0 
shuffle (data ) 
for (v_main, v_context, b_main, b_context, gradsq_W_main , 
gradsq_W_context , 
gradsq_b_main, gradsq_b_context , cooccurrence) in 
data: 


weight = (cooccurrence / x_max) ** alpha if 
cooccurrence < x_max else 1 


cost_inner = (v_main.dot(v_context ) 
+ b_main[0O] + b_context[0O] 
——log(cooccurrence ) ) 
cost = weight * (cost_inner xx 2) 
global_cost += 0.5 x* cost 
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42 grad_main = weight * cost_inner * v_context 

43 grad_context = weight «x cost_inner * v_main 

44 grad_bias_main = weight * cost _inner 

45 grad_bias_context = weight * cost_inner 

46 

47 v_main —= (learning-_rate *« grad_main / np. sqrt ( 
gradsq_W_main ) ) 

48 v_context —= (learning_rate x* grad_context / np. sqrt ( 


gradsq_W-_context ) ) 


50 b_main —= (learning_rate x« grad_bias_main / np. sqrt( 
gradsq_b_main ) ) 

51 b_context —= (learning_rate x* grad_bias_context / np. 
sqrt ( 

52 gradsq_b_context ) ) 

54 gradsq_W_main += np. square(grad_main ) 

55 gradsq_W_context += np.square(grad_context ) 

56 gradsq_b_main += grad_bias_main x*x 2 

57 gradsq_b_context += grad_bias_context x*x 2 





The learned weight matrix consists of two sets of vectors, one if the word is in the 
main word position and one for the context word position. We will average them to 
generate the final GloVe embeddings for each word. 


| merge_vectors(W, merge_fun=lambda m, c: np.mean([m, c], 


axis —0)): 
VOCdb=siIze =—ei1mi(C lem Ovi (2) 
4 for 1, row in enumerate (W[: vocab_size ]): 
5 merged = merge_fun(row, W[i + vocab-_-size ]) 
6 merged /= np. linalg .norm( merged ) 
7 W[i, :] = merged 


9 return W[: vocab-_size ] 





ii|embedding = merge_vectors (W) 


5.6.3.10 GloVe Vector Similarity 


Let’s examine the translational properties of these vectors. We define a simple func- 
tion that returns the 5 most similar words to the word “man.” 


most_similar(embedding , vocab, id2word, ’man,’ 5) 
(>woman’, 0.9718018808969603) 
3] (’ girl’, 0.9262655177669397) 


(’ single’, 0.9222400016708986) 
s| (’ dead’, 0.9187203648559261) 
(’?young’, 0.9081009733127359) 
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Interestingly, the similarity results fall into two categories. Whereas “woman” and 
“girl” have similar semantic meaning to “man,” the words “dead” and “young” do 
not. But these words do co-occur often, with phrases such as “young man” or “dead 
man.’ GloVe embeddings can capture both contexts. We can see this when we visu- 
alize the embeddings using T-SNE (Fig. 5.32). 
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Fig. 5.32: GloVe embeddings visualized using T-SNE 


5.6.3.11 Using the Glove Package 


While useful, our python implementation is too slow to run with a large corpus. The 
glove library is a python package that implements the GloVe algorithm efficiently. 
Let’s retrain our embeddings using the glove package. 


from glove import Corpus, Glove 


3) corpus = Corpus () 
corpus. fit (documents , window=5) 


glove = Glove(no_components=100, learning_rate =0.05) 
glove. fit (corpus. matrix ,epochs=30,no_threads =4, verbose=True ) 
glove.add_dictionary (corpus. dictionary ) 
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Let’s assess the quality of these embeddings by examining a few words. 


glove. most_similar(’man’, number=6) 
[(’> woman’, 0.9417155142176431), 
(> young’, 0.8541752252243202) , 
(>guy’, 0.8138920634188781) , 
(> person’, 0.8044470112897205) , 
(> girl’, 0.793038798219135) | 


glove. most_similar(’ nice’, number=6) 
[(’> guy’, 0.7583150809899194) , 
(> very’, 0.7071106359169386) , 


(’seems’, 0.7048211092737807) , 
(’> terrible ’, 0.697033427158236) , 
(’?fun’, 0.6898111303194308) |] 


glove. most_similar(’ apple’, number=6) 
[(’? industry ’, 0.6965166116455955) , 
(’>employee’, 0.6724064797672178) , 
(> fbi’, 0.6280345651329606) , 
(> gambling’, 0.6276268857034702) , 
(?indian’, 0.6266591982382662) | 





Once again, the most similar words exhibit both semantic similarity and high co- 
occurrence probability. Even with the additional context, GloVe embeddings still 
lack the ability to handle antonyms and word sense disambiguation. 


5.6.4 Document Clustering 


The use of word embeddings provides a useful and efficient means for document 
clustering in comparison with traditional approaches such as LSA or LDA. The 
simplest approach is a bag-of-words method where a document vector is created by 
averaging the vectors of each of the words in the document. Let’s take our Slate 
corpus and see what we can find with this approach. 


5.6.4.1 Document Vectors 


We create a set of document vectors by adding the vectors of each word in the 
document and dividing by the total number of words. 





i| documents=[gensim. utils .simple_preprocess(ndf[’text’].iloc[i]) 
for 1 in range(len(ndf)) ] 


2} corpus = Corpus () 
3} corpus. fit (documents , window=5) 
4} glove = Glove(no_components=100, learning_rate =0.05) 


5| glove. fit (corpus. matrix ,epochs=10,no_threads =4, verbose=True ) 


5.6 Case Study 25] 


6| glove. add_dictionary (corpus. dictionary ) 
7) print( Glove embeddings trained,” ) 


olsdiOe2vectors = |] 

io] for doc in documents: 

ul vec = np. zeros ((dim,) ) 

12 for token in doc: 

13 vec += glove. word_vectors[glove. dictionary [token ] ] 
14 if len(doc) > 0: 

15 vic Cr aavice) Te madoe ) 

16 doc_vectors .append (vec ) 

17 

is| print( Processed documents = ~;lem(Cdoc_vectors ) ) 





If we visualize these embeddings using T-SNE, we can see there are several pro- 
nounced clusters (Fig. 5.33). 





yl) 4) -20) 0 a) 4) ot) Hl 


Fig. 5.33: Document vectors visualized using T-SNE 


5.6.5 Word Sense Disambiguation 


Word sense disambiguation is an important task in computational linguistics. How- 
ever, word2vec or GloVe embeddings map words to a single embedding vector, 
and therefore lack the ability to disambiguate between multiple senses of words. 
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The sense2vec algorithm is an improved approach that can deal with polysemy or 
antonymy through supervised disambiguation. Moreover, sense2vec is computation- 
ally inexpensive and can be implemented as a preprocessing task prior to training a 
word2vec or GloVe model. To see this, let’s apply the sense2vec algorithm to our 
corpus by leveraging the spaCy library to generate part-of-speech labels that will 
serve as our supervised disambiguation labels. 


5.6.5.1 Supervised Disambiguation Annotations 
Let’s process the sentences in our corpus using the spaCy NLP annotations. We 


create a separate corpus where each word is augmented by its part-of-speech label. 
For instance, the word he is mapped to he_PRON. 


import spacy 

nlp = spacy=load( en Jdisable =| parser .” ner |) 
corpus s— "dil text |r tolist() 

print(”Number of docs = ”,len(corpus)) 


| docs = [] 
count = 0 
3} for item in corpus: 
docs.append(nlp (item ) ) 
count += 1 
if count % 10000 == 0: 
print(” Processed document #”,count) 


99 99 


sense_corpus = [[x.text+”_”+x.pos_ for x in y] for y in docs] 








5.6.5.2 Training with word2vec 


With the new preprocessed corpus, we can proceed with training word2vec. We can 
use this trained model to look at how words like “run” or “lie” can be disambiguated 
based on their part-of-speech. 


model.wv. most_similar (”run_NOUN” , topn=5) 
[(’>runs.NOUN’, 0.5418172478675842) , 
(’term_NOUN’, 0.5085563063621521), 
(> ropy_VERB’, 0.5027114152908325) , 
(’? distance_NOUN’, 0.49787676334381104) , 
(’sosa.NOUN’, 0.4942496120929718) |] 


model.wv. most_similar (”run_VERB” ,topn=5) 
[(’ put_VERB’, 0.6089274883270264) , 
(’ work_VERB’, 0.599068284034729) , 
(’hold_VERB’, 0.5984195470809937) , 
(’? break_VERB’, 0.5887631177902222) , 
(> get_VERB’, 0.5873323082923889) |] 
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model.wv. most_similar(” liec_.NOUN”,topn=5) 
2} [(’ truth_.NOUN’, 0.6057517528533936) , 
3} (’? guilt.NOUN’, 0.5678446888923645) , 
(’sin-NOUN’, 0.565475344657898) , 
(> perjury_NOUN’, 0.5402902364730835) , 
(’madness_NOUN’, 0.5183135867118835) | 


model.wv. most_similar(” lie_VERB” ,topn=5) 
2} [(’ talk _VERB’, 0.662897527217865) , 
3} (’expose_VERB’, 0.64887535572052) , 
(> testify_VERB’, 0.6263021230697632) , 
(’>commit_VERB’, 0.6155776381492615) , 
(’leave_VERB’, 0.5946056842803955) |] 





5.6.6 Exercises for Readers and Practitioners 


Word embedding algorithms can be extended in a number of interesting ways, and 
the reader is encouraged to investigate: 


1. Training embeddings based on character n-grams, byte-pairs, or other subword 
approaches. 

2. Applying an embeddings approach to cluster named entities. 

3. Using embeddings as input features for a classifier. 


In subsequent chapters, the reader will realize that embeddings are fundamental to 
the application of neural networks to text and speech. Furthermore, embeddings 
enable transfer learning and are an important consideration in any deep learning 
algorithm. 
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Chapter 6 ®@ 
Convolutional Neural Networks Cheek fo 


6.1 Introduction 


In the last few years, convolutional neural networks (CNNs), along with recurrent 
neural networks (RNNs), have become a basic building block in constructing com- 
plex deep learning solutions for various NLP, speech, and time series tasks. LeCun 
first introduced certain basic parts of the CNN frameworks as a general NN frame- 
work to solve various high-dimensional data problems in computer vision, speech, 
and time series [LB95]. ImageNet applied convolutions to recognize objects in 1m- 
ages; by improving substantially on the state of the art, ImageNet revived interest in 
deep learning and CNNs. Collobert et al. pioneered the application of CNNs to NLP 
tasks, such as POS tagging, chunking, named entity resolution, and semantic role 
labeling [CWO8b]. Many changes to CNNs, from input representation, number of 
layers, types of pooling, optimization techniques, and applications to various NLP 
tasks have been active subjects of research in the last decade. 

The initial sections of this chapter describe CNNs, starting with the basic op- 
erations, and demonstrate how these networks address the reduction in parameters 
while creating an inductive bias towards local patterns. Later sections derive the for- 
ward and backward pass equations for the basic CNN. Applications of CNNs and 
their adaptations to text inputs are introduced next. Classic CNN frameworks are 
then presented, followed by modern frameworks, in order to provide readers with 
examples of the diversity of ways in which CNNs are used in different domains. 
Special attention is paid to popular applications of CNNs to various NLP tasks. 
This chapter also describes specific algorithms that allow deep CNN frameworks 
to run more efficiently on modern GPU-based hardware. To provide readers with 
a practical, hands-on experience, the chapter concludes with a detailed case study 
of airline tweet sentiment analysis using many of the discussed CNN frameworks 
using Keras and TensorFlow for implementation. In this case study, readers are pro- 
vided with a detailed exploratory data analysis, preprocessing, training, validation, 
and evaluation, similar to what one can experience in a real-world project. 
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6.2 Basic Building Blocks of CNN 


The next few sections introduce fundamental concepts and building blocks of CNNs. 
Note that since CNNs originated in computer vision applications, many of the terms 
and examples in their building blocks refer to images or two-dimensional (2d) ma- 
trices. As the chapter continues, these will be mapped to one-dimensional (1d) text 
input data. 


6.2.1 Convolution and Correlation in Linear Time-Invariant 
Systems 


6.2.1.1 Linear Time-Invariant Systems 


In signal processing or time series analysis, a transformation or a system that is 
linear and time-invariant is called a linear time-invariant system (LT1); that is, if 
y(t) = T(x(t)), then y(t —s) = T(x(t—s)), where x(t) and y(t) are the inputs and 
the outputs, while 7() is the transformation. 

A linear system possesses the following two properties: 


1. Scaling: T (ax(t)) = aT (x(t)) 
2. Superposition: T (x; (t) + x2(t)) = T (x1 (t)) + T (x2(t)) 


6.2.1.2 The Convolution Operator and Its Properties 

Convolution is a mathematical operation performed on LTI systems in which an in- 
put function x(t) is combined with a function h(t) to give a new output that signifies 
an overlap between x(t) and the reverse translated version of h(t). The function h(t) 


is generally known as a kernel or filter transformation. In the continuous domain, 
this can be defined as: 


ya) = (xx) i h(t T)dT (6.1) 
In the discrete domain, in one dimension, this can be defined as: 


y(i) = (hx x) = Dy (6.2) 
Similarly in two dimensions, mostly used in computer vision with still images: 
yi, j) = (hx x)( a? a m,n)x(i—m,i—n) (6.3) 


This can be also written as cross-correlation or flipped or rotated kernel: 


Vigd) =x) =n +m,i+n)h(—m, —n) (6.4) 
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yi, j) = (hx x)(i, j) =x(i+m,i+n) x rotate;go{h(m,n) } (6.5) 


Convolution exhibits the general commutative, distributive, associative, and dif- 
ferentiable properties. 


6.2.1.3 Cross-Correlation and Its Properties 


Cross-correlation is a mathematical operation very similar to convolution and is a 
measure of similarity or of the strength of the correlation between two signals x(t) 
and ht(t). It is given by: 


y(t) =(h@xy(t)= fo r(r)x(t-+7) (6.6) 
In the discrete domain, in one dimension, this can be defined as: 


y(i) = (A®x)( =e) (6.7) 


Similarly in two dimensions: 


y(i, j) = (h®x)(i, j) = ¥ ¥ h(m,n)x(i + m,i+n) (6.8) 


nem 


It is important to note that cross-correlation is very similar to convolution but does 
not exhibit commutative and associative properties. 


Many CNNs employ the cross-correlation operator, but the operation is called 
convolution. We will use the terms synonymously, as the main idea of both is 
to capture similarity in input signals. Many of the terms used in CNNs have 
their roots in image processing. 


In a regular NN, the transformation between two subsequent layers involves 
multiplication by the weight matrix. By contrast, in CNNs, the transformation 
instead involves the convolution operation. 


6.2.2 Local Connectivity or Sparse Interactions 


In a basic NN, all the units of the input layer connect to all the units of the next layer. 
This connectivity negatively impacts both computational efficiency and the ability to 
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capture certain local interactions. Consider an example shown in the Fig. 6.1, where 
the input layer has m = 9 dimensions, and the hidden layer has n = 4 dimensions; 
in a fully connected NN, as shown in Fig. 6.1la, there will be m x n = 36 connec- 
tions (thus, weights) that the NN has to learn. On the other hand, if we allow only 
k = 3 spatially proximal inputs to connect to a single unit of the hidden layer, as 
shown in Fig. 6.1b, the number of connections reduces to n x k = 12. Another ad- 
vantage of limiting the connectivity is that restricting the hidden layer connections 
to spatial-proximal inputs forces the feed-forward system to learn local features 
through backpropagation. We will refer to the matrix of dimension x as our filter, 
or kernel. The spatial extent of the connectivity or the size of the filter (width and 
height) is generally called the receptive field, due to its computer vision heritage. 
In a 3d input space, the depth of the filter is always equal to the depth of the input, 
but the width and height are the hyperparameters that can be obtained via search. 


6.2.3 Parameter Sharing 


Parameter sharing or tied weights is a concept in which the same parameters 
(weights) are reused across all the connections between two layers. Parameter shar- 
ing helps reduce the parameter space and hence, memory usage. As shown in 
Fig.6.2, instead of learning n x k = 12 parameters, as would happen in Fig. 6.2a, 
if local connections share the same set of k weights, as shown in Fig. 6.2b, there is a 
reduction in memory usage by a factor of n. Note that the feed-forward computation 
still requires n x k operations. Parameter sharing can also result in the transforma- 
tion property called equivariance, in which the mapping preserves the structure. A 
function f() is equivariant to a function g(), if f(g(x)) = g(f(x)) for input x. 


The combination of local connectivity and parameter sharing results in filters 
that capture common features acting as building blocks across all inputs. In 
image processing, these common features learned through filters can be basic 
edge detection or higher-level shapes. In NLP or text mining, these common 
features can be combinations of n-grams that capture associations of words or 
characters that are over-represented in the training corpus. 


6.2.4 Spatial Arrangement 


Note that the number, arrangement, and connections of the filters are hyperparame- 
ters in a model. The depth of the CNN is the number of layers, whereas the number 
of filters in a layer determines the “depth” of the subsequent layer. How the filters 
get moved, that is, how many inputs get skipped before the next instance of the filter, 
is known as the stride of the filter. For example, when the stride is 2, two inputs are 
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Fig. 6.1: Local connectivity and sparse interactions. (a) Fully connected layers. (b) 
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Fig. 6.2: Local connectivity and parameter sharing. (a) Locally connected layers. 
(b) Locally connected layers with parameter sharing 
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Fig. 6.3: Spatial arrangement and relationship between hyperparameters. (a) Spatial 
arrangement leading to N = 2310 + 1 = 4 neurons in the next layer. (b) Spatial 
arrangement leading to N = as + 1 =5 neurons in the next layer 


skipped before the next instance of the filter 1s connected to the subsequent layer. 
The inputs can be zero-padded at the edges, so that filters can fit around the edge 
units. The number of padding units is another hyperparameter for tuning. Figure 6.3 
illustrates different paddings between layers. Adding paddings to the edges with 
values of 0 is called zero-padding; the convolution performed with zero-padding is 
called wide convolution, and one without is called narrow convolution. 

The relationship between the number of inputs (input volume) W, receptive field 
(filter size) F’, stride size S, and the zero-padding P leads to the number of neurons 
in the next layer, as provided in Eq. (6.9). 


_W-F+2P 
7 S 


N +1] (6.9) 
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Figure 6.4 illustrates how a 2d filter convolves with the input generating the final 
output for that layer, with the convolution steps broken down. Figure 6.4 visual- 
izes all of the above spatial arrangements and illustrates how changing the padding 
affects the number of neurons according to the equation above. 


Input Kernel Output 


(1) 


AyiWii + AM Ay2Wii + Aig i2 
+ XayWoq +XogWe2 | + XagWo, + Xo3Wo2 


(1) (2) 





(2) 


Pra In : May Wi + X22Wy2 A22Wiy + Xa3Whp 
_ = + Xa, Wo, + Ag2Wo2 | + Xa2W21 + XaaW22 
a 

We | Wea 


(3) (4) 


(3) 








Fig. 6.4: A 2d filter of size 2 x 2 on a 2d input of size 3 x 3 gives a convolved output 
of 2 x 2 with a stride of | and no zero-padding 


May Maz 


Figure 6.5 extends the convolution process to a 3d input with three channels 
(similar to RGB channels in an image) and two filters, showing how the linear con- 
volution reduces the volume between layers. As shown in Fig. 6.6, these filters act 
as a complex feature detection mechanism. Given a large volume of training data, a 
CNN can learn filters, such as horizontal edge, vertical edge, outline, and more in 
the classification process. 


The number of channels in the filters should match the number of channels in 
its input. 


Most practical toolboxes or libraries that implement CNNs throw exceptions 
or safely handle the relationships when hyperparameters violate the con- 
straints given by Eq. (6.9). 
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Fig. 6.5: An illustration of an image with dimensions of 6 x 6 x 3 
(height x width x channels) convolving with two filters each of size 3 x 3 x 3, no 
padding and stride 1, resulting in the output of 4 x 4 x 2. The two filters can be 
thought of as working in parallel, resulting in two outputs on the right of the dia- 
gram 
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Fig. 6.6: Four filters capturing basic image features 


6.2.5 Detector Using Nonlinearity 


The output of a convolution is an affine transformation that feeds into a nonlinear 
layer or transformation known as a detector layer/stage. This is very similar to the 
activation function studied in Chap.4, where the affine transformation of weights 
and inputs passes through a nonlinear transformation function. The detector layer 
normally uses the sigmoid f(x) = —,, hyperbolic tangent f(x) = tanh(x), or 


I+e*? 
ReLU f(x) = max(0,x) as the nonlinear function. As discussed in Chap. 4, ReLU 
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is the most popular function because of its easy computation and simple differen- 
tiation. ReLU has also been shown to lead to better generalization when used in 
CNNs. 


6.2.6 Pooling and Subsampling 


The outputs of one layer can be further downsampled to capture summary statistics 
of local neurons or sub-regions. This process is called pooling, subsampling or 
downsampling based on the context. Usually, the outputs from the detector stage 
become the inputs for the pooling layer. 

Pooling has many useful effects, such as reduction of overfitting and reduction of 
the number of parameters. Specific types of pooling can also result in invariance. 
Invariance allows identification of a feature irrespective of its precise location and 
is an essential property in classification. For example, in a face detection problem, 
the presence of features that indicate an eye is not only essential but it also exists 
irrespective of its location in an image. 

There are different pooling methods, each with its own benefits, and the particular 
choice depends on the task at hand. When the bias of the pooling method matches 
the assumptions made in a particular CNN application, such as in the example of 
face detection, one can expect significant improvement in the results. Some of the 
more popular pooling methods are listed below. 


6.2.6.1 Max Pooling 


As the name suggests, the max pooling operation chooses the maximum value of 
neurons from its inputs and thus contributes to the invariance property discussed 
above. This is illustrated in Fig.6.7. Formally, for a 2d output from the detection 
stage, a max pooling layer performs the transformation: 
—_ 1-1 
hy j = maxhjs 5 i+ (6.10) 
Pq 

where p and g denote the coordinates of the neuron in its local neighborhood and / 


represents the layer. In k-max pooling, k values are returned instead of a single value 
in the max pooling operation. 


6.2.6.2 Average Pooling 
In average or mean pooling, the local neighborhood neuron values are averaged to 


give the output value, as illustrated in Fig. 6.7. Formally, average pooling performs 
the transformation: 
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ie Ae (6.11) 


where m X m 1s the dimension of the kernel. 


6.2.6.3 L2-Norm Pooling 


L2-norm pooling is a generalization of the average pooling and is given by: 


= fab, (6.12) 


There are indeed many variants of pooling, such as k-max pooling, dynamic pool- 
ing, dynamic k-max pooling, and others. 


hi 





(a) (b) 


Fig. 6.7: Examples of pooling operations. (a) Max pooling. (b) Average pooling 


6.2.6.4 Stochastic Pooling 


In stochastic pooling, instead of picking the maximum, the picked neuron is drawn 
from a multinomial distribution. Stochastic pooling works similar to dropout in ad- 
dressing the issue of the overfitting [ZF13b]. 


6.2.6.5 Spectral Pooling 


In spectral pooling, the spatial input is transformed into a frequency domain through 
a discrete Fourier transform (DFT) to capture important signals in the lower dimen- 
sion. For example, if the input is x € R”*” and has to be reduced to the size h x w, a 
DFT operation is performed on the input, so that the frequency representation main- 
tains the central h x w submatrix; then, an inverse DFT is performed to transform 
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back into the spatial domain [RSA15]. This transformation has the effect of dimen- 
sionality reduction on the space and can be very effective in certain applications. 


6.3 Forward and Backpropagation in CNN 


Now that all basic components are covered, they will be connected. This section 
will also go through the step-by-step process to clearly understand the different 
operations involved in the forward and the backward pass in a CNN. 

In the interest of clarity, we will consider a basic CNN block consisting of one 
convolutional layer, a nonlinear activation function, such as ReLU that performs a 
non-linear transformation, and a pooling layer. In real-world applications, multiple 
blocks like this are stacked together to form network layers. The output of these 
blocks is then flattened out and connected to a fully connected output layer. The 
flattening out process converts a multidimensional tensor to a mono-dimensional 
vector, for example a three-dimensional (W,H,N) to a vector of dimension d = 
WxHAHxN. 

Let us start the derivation for the layer /, which is the output of convolution on 
layer /— 1. Layer / has height h, width w, and channels c. Let us assume that there is 
one channel, 1.e. c = 1, and has iterators 7, 7 for the input dimensions. We will con- 
sider the filter k; x ky dimensions with iterators m,n for the convolution operation. 
The weight matrix with weights Wien and the bias b’ transforms the previous layer 
/ — 1 into the layer 7 via the convolution operation. The convolution layer is fol- 
lowed by a non-linear activation function, such as ReLU f(-). The output for layer 
/ is denoted by 0} i 

Thus forward propagation for layer / can be written as: 


X} , = rotate;go{W,n} XO; 7) +5 (6.13) 


This can be expanded as: 


X= > EWnnCmjan th (6.14) 
m n 
and 
O;; = f(Xi;) (6.15) 


We assume error or loss mechanisms such as mean-squared-error EF is used to mea- 
sure the difference between the predictions and the actual labels. The errors have to 
be propagated back and need to update the weights of the filter and the inputs re- 
ceived at the layer /. Thus, in the backpropagation process, we are interested in the 
gradient of the error E with respect to (w.r.t.) the input (22) and the filter weights 


(Sw) 
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6.3.1 Gradient with Respect to Weights oe 

We will first consider the impact of single pixel (m ; n ) of the kernel, given by W_ 
on the error EF, using the chain rule: 


as a l 

OE I 0E 9X; 

ow!, / O OX}, ow'!, / 
m ,n ? m ,n 





(6.16) 
i=0 j= 


If we consider (6! ;) as the gradient error for the layer / w.r.t input, the above 


equation can be rewritten as: 


h—k, w—ko nee 


-2 de Law ; (6.17 





Writing the change in output in terms of the inputs (the previous layer) we get: 


OX d i tel | 
aw! 7 ow. . dn, 1am ,j+n + b (6.18) 
m yn’ Pe m 





When we take partial derivatives with respect to W_,_, all values become zero 


except for the components mapping to m = m andn=n. 














OX; _  @ wi.o! wi, o'-! i e46 
ow!, . owl, , 0,0 eg gnc ae el ol Pe eae (6. ) 
m ,n m n 

OX; 0 

awe, OWE \ em al Cie! te (6.20) 
m ,n m n 
ox! 
J _ 7-1 

aw'!, j ae Ae ere (6.21) 

m ,n 


Substituting the result back, one gets: 


cE h—k, w—ko 


awi,, + yy dX & 79 te ae) (6.22) 


¢ 
ae af) 


The whole process can be summarized as the convolution of the gradients 
with rotation of 180 degrees 5; j of layer / with the outputs of layer / — 1. 


ie. OO oa ,. Thus, the new weights to be updated can be computed very 
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similarly to the forward pass. 





ia 
= rotate;gq{d a} x On (6.23) 


m ,n 


6.3.2 Gradient with Respect to the Inputs ge 


Next, we are interested in how a change to single input, given by X, 7 affects the 
error E. Borrowing from computer vision, the oe pixel x, a after convolution 


affects a region bounded by top left (i +k; —1,j +k—1) ae Seiad right (i,j ). 
So, we obtain: 


OE kj —-lko-1 OE Oxi 


m,j — 
=> > — (6.24) 
[+1 
OX) 4 j m—0 n—O OX" nj —n OX, ei 
dE Metesl oxi | 
_ ot i —m a —n 
OX) = ee oe ox! Czy 


Expanding just the rate - change of the inputs, one obtains: 


Oxi! 


aR mj —n se (BE with OF Seay ie pag 3) (6.26) 





xi j’ 


Writing this in terms of input layer /, we get: 


ox!t! 


i’ —m,j’ eo pif [+1 
——p ae ee Wee (6.27) 
OX, / A 





All partial derivatives result into zero except where m =mandn =n, 


f (Xi eee a =p (X; ,), and wie ry = =Writn in the relevant output regions. 


Ox! « 
i ) I 
OX; F; OX (ws meals O-— m-+m’ ,O— an) to “+ Wn ip (xd 1) 





LJ 


ie 0) (6.28) 
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axitl ; 5 
i-m,j —n _ +1 l 
OX; , OX, , (wa (1) eee 
i,j ij 
axit! 
J JS = / / 
eet = wasls (xh) 30) 


Ley 
Substituting, one obtains: 


OE ky—-1ky-1 
ax, 2 2 Or 
ij 


m=0 n=0 


Wri tt (X40) (6.31) 


J 
i—m,j —n 


The term 


m=0 


Soe. Nie 5/*' , W*! can be seen as the flipped filter, or the filter ro- 





=o. 
tated by 180 ones: performing convolution with the 6 matrix. 
Thus, 
OE —_ l+1 y, [+1 / x! 6.32 
ae ( ye rotate |g {Wink \)\r ( 1) (6.32) 
i,j 


6.3.3 Max Pooling Layer 


As we saw in the pooling section, the pooling layer does not have any weights but 
just reduces the size of the input based on the spatial operations performed. In the 
forward pass, the pooling operation leads to conversion of a matrix or a vector to a 
single scalar value. 

In max pooling, the winner neuron or cell is remembered; when the backpropa- 
gation needs to be done, the entire error is passed on to that winner neuron, as others 
have not contributed. In average pooling, an n x n pooling block during backprop- 
agation divides the total scalar value by —. and distributes it equally across the 
block. 


6.4 Text Inputs and CNNs 


Various NLP and NLU tasks in text mining use a CNN as the feature engineering 
block. We will start with basic text classification to highlight some important analo- 
gies between images in computer vision and mappings in text and the necessary 
changes in the CNN process. 
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6.4.1 Word Embeddings and CNN 


Convolution Output ReLU Output Max-Pooling 
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Fig. 6.8: A simple text and CNN mapping is shown 


Let us assume all training data are in the form of sentences with labels and of 
given maximum length s. The first transformation is to convert the sentences into a 
vector representation. One way is to perform a lookup function for each word in the 
sentence for its fixed-dimensional representation, such as Word Embeddings. Let us 
assume that the lookup for word representation in a fixed vocabulary size V yields a 
vector of fixed dimension and let that be d; thus, each vector can be mapped to R¢, 
The rows of the matrix represent words of the sentences, and the columns can be the 
fixed-length vector corresponding the representation. The sentence is a real matrix 
X ER? 

The general hypothesis, especially in classification tasks, 1s that the words which 
are local in the sequence, similar to n-grams, form complex higher-level features 
when combined. This combination of local words in proximity is analogous to com- 
puter vision, where local pixels can be combined to form features such as lines, 
edges, and real objects. In computer vision with image representations, the con- 
volution layer had filters smaller in size than the inputs performing convolution 
operations via sliding across an image in patches. In text mining, the first layer of 
convolution generally has filters of the same dimension d as the input but has vary- 
ing height 4, normally referred to as the filter size. 

Figure 6.8 illustrates this on the sentence “The cat sat on the mat” which is to- 
kenized into s = 6 words {The, cat, sat,on,the, mat}. A lookup operation obtains a 
3-dimensional word embeddings (d = 3) for each word. A single convolution filter 
with height or size h = 2 starts producing the feature map. The output goes through a 
non-linear activation of ReLU with 0.0 threshold that then feeds into a 1-max pool- 
ing. Figure 6.8 illustrates the end state obtained by using the same shared kernel 
across all the inputs, producing the single output at the end of the 1-max pooling. 

Figure 6.9 gives a generalized methodology from sentences to output in a simple 
CNN framework. In the real world, many representations of the input words can 
exist similar to images having color channels in the computer vision field. Different 
word vectors can map to the channels that may be static (.e., pre-trained) using a 
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well-known corpus and do not change. They can also be dynamic, where even if 
they were pre-trained, backpropagation can fine-tune it. There are various applica- 
tions that use not only word embeddings for the representation but also POS tags 
for the words or the position in the sequence. The application and the NLP task at 
hand determine the particular representations. 

Generally, outputs result in regions, and there are multiple of these varying in 
size, generally from 2 to 10, 1.e., sliding over 2-10 words. For each region size, 
there can be multiple learned filters, given by n. Similar to the image representation 
calculations derived above, there are —- + 1 regions, where stride is the number 
of words filters slide across. Thus, the output of the convolution layers are vectors 
of dimensions R*~*! for a stride of 1. Formally, 





0, =W-X[i:i+h—1,!] (6.33) 


This equation represents how a filter matrix W of height / slides over the region 
matrix given by X{i: i+ —1,:] with unit stride. For each filter, the weights are 
shared as in image-based frameworks giving the local feature extraction. The output 
then flows into a non-linear activation function f(a), most commonly ReLU. The 
output generates a feature map c; with bias b for each filter or region as in: 


ci = f(oi) +b (6.34) 


The output vector of each of the feature maps goes through a pooling layer, as dis- 
cussed before. The pooling layer provides the downsampling and invariance prop- 
erty, as mentioned above. The pooling layer also helps in addressing the variable 
length of words in sentences by yielding a reduced dimension ¢ for the entire vec- 
tor. Since vectors represent words or sequences over time, max pooling in the text 
domain is referred to as max pooling over time. How two different sizes of texts 
get reduced to same-dimensional representation through max pooling over time is 
shown in Fig. 6.10. 


In short-text classification, 1-max pooling has been effective. In document 
classification, k-max pooling has been a better choice. 


The output of the pooling layers is concatenated and passed on to the softmax 
layer, which performs classification based on the number of categories or labels. 
There are many hyperparameter choices when representing text and performing 
CNN operations on text, such as type of word representation, choice of how many 
word representations, strides, paddings, filter width, number of filters, activation 
function, pooling size, pooling type, and number of CNN blocks before the final 
softmax layer, to name a few. 

The total number of hyperparameters for one simple block of CNN with output 
for text processing can be given by: 
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Fig. 6.9: A simple text and CNN mapping. Sentence to word representations, map- 
ping to the embeddings with different channels act as an input. Different filters for 
each height, 1.e., 2 filters each of sizes 2 and 3, shown in two shades, capture differ- 
ent features that are then passed to non-linear processing to generate outputs. These 
go through max pooling operation, which selects the maximum values from each 
filter. The values are concatenated to form a final output vector 


parameters= (V+1)xd +((hxd)xny)+np+ np+1 (6.35) 
————— a ae ae ee 
WordEmbeddings(static) Filters SoftmaxOutput 


Now, we will provide a formal treatment from input to output for a simple CNN 
in the text domain with word representations. For a variable-length sentence from 
training data, with words of maximum length s having a similar lookup for word 
embeddings of vector size d, we obtain an input vector in R‘. For all other sentences 
which have fewer than s words, we can use padding with O or random values in the 
inputs. As in the previous section, there can be many representations of these words, 
such as static or dynamic, different embeddings for the words as word2vec, GloVe, 
etc., and even different embeddings such as positional, tag-based (POS-tag), etc., 
all forming the input channels with the only constraint that they are all of the same 
dimension. 

The convolution filter can be seen as weights in Id of the same length as the word 
embedding but sliding over words of size h. This can create different word windows 
Wi:n, W2:n--Wn—n+i:n. These feature vectors are represented by |01,02,--+ ,On—n+1]| 
in R"-5+!, which go through non-linear activation, such as ReLU. The output of 
non-linear activation has the same dimension as the input and can be represented 
as [C1,€2,°: ,Cn—n+i1] in R"**!. Finally, the outputs of the non-linear activation 
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Fig. 6.10: Two simple sentences with padding and 3-dimensional word embeddings, 
going through same single filter of size or height 2, a ReLU with a threshold of 0, 
and max pooling over time, result in the similar output value corresponding the 
2-gram of “The cat” 


function are further passed to a max pooling layer which finds a single scalar value 
€ in RR for each filter. In practice, there are multiple filters for each size h, and there 
are multiple-sized filters for h varying from 2 to 10 in the general case. The output 
layer connects all the max pooling layer outputs ¢ into a single vector and uses 
the softmax function for classification. The considerations of padding sentences at 
the beginning and the end, stride lengths, narrow or wide convolutions, and other 
hyperparameters are as in the general convolution mapping process. 


6.4.2 Character-Based Representation and CNN 


In many classification-related tasks, the vocabulary size grows large, and taking 
into account unseen words in training even with embeddings results in suboptimal 
performance. Work by Zhang et al. [ZZL15] uses character-level embeddings in- 
stead of word-level embeddings in the training input to overcome such issues. The 
researchers show that character embeddings result in open vocabulary, a way to 
handle misspelled words, to name a few benefits. Figure 6.11 shows the designed 
CNN to have many convolution blocks of 1d convolution layer, non-linear activa- 
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tions, and k-max pooling, all stacked together to form a deep convolution layer. The 
representation uses the well-known set of 70 alphanumeric characters with all low- 
ercase letters, digits, and other characters using one-hot encoding. Combination of 
characters with fixed size / = 1024 as a chunk of the input at a given time forms 
the input for variable-length text. A total of 6 blocks of 2 sets of CNNs, each of 
different size, the larger of 1024 dimensions and the smaller of 256 dimensions are 
used. Two layers have filter height of 7, and the rest have height of 3, with 3-pooling 
in the first few and strides of 1. The last 3 layers are fully connected. The final layer 
is a soft-max layer for classification. 


Fully Connected (2048) | 
Fully Connected Layers "Fully Connected (2048) 

[__FullyConnected (2048) | 
Multiple CNN Blocks __1DONN (1024/256,K-3) | 
| 1D CNN (1024/256, K=3) | 


Softmax Layer 





Characters as 1-hot 
encoding 


Sentence with 1024 


Cc. c CiH 
characters SS acee 


Fig. 6.11: Character-based CNN for text classification 


6.5 Classic CNN Architectures 


In this section, we will visit some of the standard CNN architectures. We will discuss 
their structure in detail and will give some historical context for each. Though many 
of them have been popular in the computer vision domain, they are still applicable 
with variations in the text and speech domains, as well. 
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6.5.1 LeNet-5 


LeCun et al. presented LeNet-5 as one of the first implementations of CNNs and 
showed impressive results on the handwritten digit recognition problem [LeC+98]. 
Figure 6.12 shows the complete design of LeNet-5. LeCun demonstrated the concept 
of decreasing the height and width via convolutions, increasing the filter/channel 
size, and having fully connected layers with a cost function to propagate the errors, 
which are now the backbone of all CNN frameworks. LeNet-5 used the MNIST 
dataset of 60K training data for training and learning the weights. In all of our 
discussions from now on, the representation of a layer will be given by n,, x ny, X Ne, 
where nNy,y,,N¢ are the width, height, and the number of channels/filters. Next, we 
will give the details of the design regarding inputs, outputs, number of filters, and 
pooling operations. 


e The input layer uses only the grayscale pixel values and is 32 x 32 x | in size. It 
is normalized to mean 0 and variance 1. 

e A filter of 5 x 5 x 6 with no padding and stride s = | is used to create a layer with 
size 28 x 28 x 6. 

e This is followed by an average pooling layer with filter width f = 2 and stride 
s = 2, resulting in a layer of size 14 x 14 x 6. 

e Another convolution layer of 5 x 5 x 16 is applied with no padding and stride 
s = | tocreate a layer of size 10 x 10 x 16. 

e Another average pooling layer with filter width f = 2 and stride s = 2, resulting 
in reduced height and width, yields a layer of size 5 x 5 x 16. 

e This is then connected to a fully connected layer of size 120 and followed by 
another fully connected layer of size 84. 

e These 84 features are then fed to an output function which uses the Euclidean 
radial basis function for determining which of the 10 digits are represented by 
these features. 


1. LeNet-5 used tanh function for non-linearity instead of ReLU, which is 
more popular in today’s CNN frameworks. 

2. Sigmoid non-linearity was applied after the pooling layer. 

3. LeNet-5 used the Euclidean radial basis function instead of softmax, 
which is more popular today. 

4. The number of parameters/weights in LeNet-5 was approximately 60K 
with approximately 341K multiplications and accumulations (MACS). 

5. The concept of no padding, which results in lowering the size, was used 
back then, but it is not very popular these days. 
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Fig. 6.12: LeNet-5 


6.5.2 AlexNet 


AlexNet, designed by Krizhevsky et al. [KSH12a], was the first deep learning archi- 
tecture that won the ImageNet challenge in 2012 by a large margin (around 11.3%). 
AlexNet was responsible in many ways for focusing attention on deep learning 
research [KSH12a]. Its design is very similar to LeNet-5, but with more layers 
and filters resulting in a larger network and more parameters to learn. The work 
in [KSH12a] showed that with deep learning frameworks, features can be learned 
instead of manually generated with deep domain understanding. The details of the 
AlexNet designs are listed below (Fig. 6.13): 


Unlike LeNet-5, AlexNet used all the three channels of inputs. The size 227 x 
227 x 3 1s convolved with 11 x 11 x 96 filter, stride of s = 4, and with non- 
linearity performed by ReLU giving an output of 55 x 55 x 96. 

This goes through a max pooling layer with size 3 x 3 and stride s = 2 and reduces 
the output to 27 x 27 x 96. 

This layer goes through a local response normalization (LRN) which effectively 
normalizes the values across the depth of the channels and then another convo- 
lution of size 5 x 5 x 256, stride s = 1, padding f = 2, with ReLU applied to get 
an output of 27 x 27 x 256. 

This goes through a max pooling layer with size 3 x 3 and stride s = 2, reducing 
the output to 13 x 13 x 256. 

This is followed by LRN, another convolution of size 3 x 3 x 384, stride s = 1, 
padding f = 1, with ReLU applied to get an output of 13 x 13 x 384. 

This is followed by another convolution of size 3 x 3 x 384, stride s = 1, padding 
f =1, with ReLU applied to get an output of 13 x 13 x 384. 
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e This is followed by convolution of size 3 x 3 x 256, stride s = 1, padding f = 1, 
with ReLU applied to get an output of 13 x 13 x 256. 

e This goes through a max pooling layer with size 3 x 3 and stride s = 2, reducing 
the output to 6 x 6 x 256. 

e This output 6 x 6 x 256 = 9216 is passed to a fully connected layer of size 9216, 
followed by a dropout of 0.5 applied to two fully connected layers with ReLU of 
size 4096. 

e The output layer is a softmax layer with 100 classes or categories of images to 
learn. 
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Fig. 6.13: AlexNet 


1. AlexNet used ReLU and showed it to be a very effective non-linear acti- 
vation function. ReLU showed six times performance improvement over 
sigmoid on a CIFAR dataset, which was the reason why the researchers 
chose ReLU. 

2. The number of parameters/weights in was approximately 63.2 million, 
and approximately 1.1 billion computations which is significantly higher 
than LeNet-5. 

3. Speedup was obtained via two GPUs. The layers were split, so that each 
GPU worked in parallel with its output to the next layer collocated in its 
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node and sent information to the other. Even with GPUs, the 90 training 
epochs required 5—6 days of training on the 1.2 million training examples. 


6.5.3 VGG-16 


VGG-16, also known as VGGNet, by Simonyan and Zisserman, is known for its 
uniform design and has been very successful in many domains [SZ14]. The unifor- 
mity in having all the convolutions being 3 x 3 with stride s = 1 and max pooling 
with 2 x 2 along with the channels increasing from 64 to 512 in multiples of 2 makes 
it very appealing and easy to set up. It has been shown that stacking convolutions of 
3 x 3 with stride s = | in three layers is equivalent to 7 x 7 convolutions and with 
a significantly reduced number of computations. VGG-16 has two fully connected 
layers at the end with a softmax layer for classification. VGG-16’s only disadvan- 
tage is the huge network with approximately 140 million parameters. Even with a 
GPU setup, a long time would be required for training the model (Fig. 6.14). 
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Fig. 6.14: VGG-16 CNN 


6.6 Modern CNN Architectures 


We will discuss changes that have led to modern CNN architectures in different 
domains, including in text mining. 
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6.6.1 Stacked or Hierarchical CNN 


The basic CNN mapping to sentences with a convolution filter of size k is shown 
to be analogous to the ngram token detector in classic NLP settings. The idea of 
a stacked or hierarchical CNN is to extend the principle by adding more layers. 
In doing so, the receptive field size is increased, and larger windows of words or 
contexts will be captured as features, as shown in Fig. 6.15. 

If we consider (W,b) as the parameters corresponding to weights and biases, 
© as the concatenation operation done on sentence of length n with sequence of 
the word embeddings e;., each of d dimension, and k as the window size of the 
convolution, the output is given by: C1-m 


Cim = CONV Ww) (C12) (6.36) 


cj = f(S(Wii+e-1)-W+b) (6.37) 


where m =n—k-+ 1 for narrow convolutions, and m =n+k-+1 for wide convolu- 
tions. 

If we consider p layers, where the convolution of one feeds into another, then we 
can write 
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Fig. 6.15: A hierarchical CNN 
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As the layers feed into each other, the effective window size or the receptive field 
to capture the signal increases. For instance, if in sentiment classification there is a 
sentence—The movie is not a very good one, a convolution filter with size 2 will not 
capture the sequence “not a very good one” but a stacked CNN with the same size 
2 will capture this in the higher layers. 


6.6.2 Dilated CNN 


In the stacked CNN, we assumed strides of size 1, but if we generalize the stride to 
size s, we can then write the convolution operation as: 


Cm = CONV Gy 5) (Win) (6.39) 
C7 = f(B(W14(-1)s-(s441) °W +b) (6.40) 


A dilated CNN can be seen as a special version of the stacked CNN. One way is 
to have the stride size of each layer be k — 1 when the kernel size is k. 


Cm = CONV Gy 4) (Win) (6.41) 

Convolutions of size k x k on the / layered CNN without pooling result in recep- 
tive fields of size / x (k—1)+k, which is linear with the number of layers /. A dilated 
CNN helps increase the receptive field exponentially with respect to the number of 
layers. Another approach is to Keep the stride size constant, as in s = 1, but perform 
length shortening at each layer using local pooling by using maximum or average 
as values. Figure 6.16 shows how by progressively increasing the dilations in the 
layers, the receptive fields can be exponentially increased to cover a larger field in 
every layer [YK15]. 


1. A dilated CNN helps in capturing the structure of sentences over longer 
spans of text and is effective in capturing the context. 

2. A dilated CNN can have fewer parameters and so increase the speed of 
training while capturing more context. 
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Fig. 6.16: A dilated CNN showing the increase in the receptive fields with dilations 
changing to 1,2, and 4 


6.6.3 Inception Networks 


Inception networks by Szegedy et al. [Sze+17] are currently one of the best- 
performing CNNs, especially in computer vision. The core of an inception network 
is a repetitive inception block. An inception block uses many filters, such as | x 1, 
3 x 3, and 5 x 5, as well as max pooling without the need to choose any one of 
them. The central idea behind using | x 1 filters is to reduce the volume and hence 
the computation before feeding it to a larger filter such as 3 x 3. 

Figure 6.17 shows a sample 28 x 28 x 192 output from a previous layer convolved 
with a 3 x 3 x 128 filter to give an output of 28 x 28 x 128; 1.e., 128 filters at the 
output result in about 174 million MACs. By having | x 1 x 96 filters to reduce the 
volume and then convolve with 3 x 3 x 128 filters, the total computations reduce 
to approximately 100 million MACs, almost a saving of 60%. Similarly, by using 
1 x 1 x 16 filters preceding 5 x 5 x 32 filters, the total computations can be reduced 
from approximately 120 to 12 million, a reduction by a factor of ten. The | x | filter 
that reduces the volume is also called a bottleneck layer. 

Figure 6.18 gives a pictorial view of a single inception block with a sample 
2d input with width and height of 28 x 28 and depth of 192, producing an out- 
put of width and height of 28 x 28 and depth of 256. The | x 1 filter plays a dual 
role in reducing the volume for other filters, as well as producing the output. The 
1 x 1 x 64 filter is used to produce an output of 28 x 28 x 64. The input, when 
passed through the | x 1 x 96 filter and convolved with 3 x 3 x 128, generates a 
28 x 28 x 128 output. The input, when passed through the | x 1 x 16 filter and 
then through the 5 x 5 x 32 filter, produces a 28 x 28 x 32 output. Max pooling 
with stride 1 and padding is used to produce an output of size 28 x 28 x 192. Since 
the max pooling output has many channels, each channel passes through another 
1 x 1 x 32 filter for reducing the volume. Thus, each filtering operation can occur 
in parallel and generate an output that is concatenated together for a final size of 
28 x 28 x 256. 
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Fig. 6.17: Computational cost savings with having a 1 x 1 filter precede a 3 x 3 filter 
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Fig. 6.18: An inception block with multiple filters of size 1 x 1, 3 x 3, 5 x 5, pro- 
ducing a concatenated output 


A 1 x 1 convolution block plays an important role in reducing the volume 
for a larger filter size convolution. An inception block allows multiple filter 
weights to be learned, thus removing the need to select one of the filters. 
Parallel operations of convolutions across different filters and concatenation 
give further improved performance. 


6.6.4 Other CNN Structures 


In many NLP tasks, such as in sentiment analysis, features using syntactic ele- 
ments and other structural information in the language have yielded improved per- 
formance. Figure 6.19 shows a syntactic and semantic representation connected 
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through dependency arcs, which map to a tree-based or a graph-based represen- 
tation. 

A basic CNN does not capture such dependencies in the language. Hence, struc- 
tured CNN have been proposed to overcome this shortcoming. Dependency CNN 
(DCNN), as shown in Fig. 6.20, is one way of representing word dependencies in 
a sentence via word embeddings, then performing tree convolutions very similar 
to matrix convolutions, and finally employing a fully connected layer before the 
output classification [Ma+15]. In Mou et al. [Mou+14], the application domain is 
programming languages instead of natural languages, and a similar tree-based CNN 
is used to learn the feature representation. 

The syntactic and dependencies can be captured with a graph representation G = 
(V,E), where words act as nodes or vertices V, and relationships between them are 
modeled as edges FE. Convolution operations can be then performed on these graph 
structures [Li+15]. 

As seen in Chap. 5, each embeddings framework such as word2vec, GloVe, etc. 
capture different distributional semantics, and each can be of different dimensions. 
Zhang et al. proposed a multi-group norm constraint CNN (MGNC-CNN) that 
can combine different embeddings, each of different dimension on the same sen- 
tence [ZRW 16]. The regularization can be applied either to each group (MGNC- 
CNN) or can be applied at the concatenation layer (MG-CNN) as shown in Fig. 6.21. 
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Fig. 6.19: A sentence with syntactic and semantic dependencies is shown 


In many applications, such as machine translation, text entailment, question an- 
swering, and others, there is often a need to compare two inputs for similarity. Most 
of these frameworks have some form of Siamese structure with two parallel frame- 
works for each sentence, combining convolution layers, nonlinear transformations, 
pooling, and stacking, till features are combined at the end. Work of Bromley et 
al. for signature comparison has been the general inspiration behind many of such 
frameworks [Bro+94]. 
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Fig. 6.20: A tree-based structured CNN that can capture syntactic and semantic 
dependencies 
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Fig. 6.21: MG-CNN and MGNC-CNN showing different embeddings of different 
dimensions used for classification. MG-CNN will have norm constraints applied at 
layer o while MGNC-CNN will have norm constraints applied at layers 0; and 02, 
respectively 


We will illustrate one such recent framework by Wenpeng et al [Yin+ 16a], who 
refers to it as a basic bi-CNN, as shown in Fig. 6.22. The framework is further mod- 
ified to have a shared attention layer and performs very well on diverse applications, 
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such as answer selection, paraphrase identification, and text entailment. The twin 
networks, each consisting of CNN and pooling blocks, process one of the two sen- 
tences and the final layer solves the sentence pair task. The Input layer has word em- 
beddings from word2vec concatenated for the words in each sentence. Each block 
uses wide convolutions so that all words in the sentence can provide signals as com- 
pared to a narrow convolution. The tanh activation function tanh( Wx; + b) is used 
as the nonlinear transformation. Next, an average pooling operation is performed on 
each. Finally, the output of both average pooling layers is concatenated and passed 
to a logistic regression function for binary classification. 
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Fig. 6.22: A basic bi-CNN by [Yin+16a] for sentence pair tasks using wide convo- 
lutions, average pooling, and logistic regression for binary classification 


6.7 Applications of CNN in NLP 


In this section, we will discuss some of the applications of CNN in various text min- 
ing tasks. Our goal is to summarize this research and provide insights into popular 
CNNs and modern designs. CNNs with various modifications and even combina- 
tions with other frameworks, such as LSTMs, have indeed been used in different 
NLP tasks. Since a CNN by itself can capture local features and combinations of 
these through further combinations, CNNs have been primarily used in text/doc- 
ument classification, text categorization, and sentiment classification tasks. CNNs 
lose the order of the sequences, though they have been used in sequence-based tasks, 
such as POS tagging, NER, chunking, and more. To be truly effective in such set- 
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tings, they either need to be combined with other frameworks or have positional 
features encoded. 


6.7.1 Text Classification and Categorization 


Many text classification tasks which employ n-grams of words to capture local in- 
teractions and features have seen lots of success using CNN frameworks in the last 
few years. CNN-based frameworks can easily capture temporal and hierarchical fea- 
tures in variable-length text sequences. Word or character embeddings are generally 
the first layers in these frameworks. Based on the data and type, either pre-trained 
or static embeddings are used to obtain a vector representation of the words in sen- 
tences. 

Collobert et al. [CWO8c], [Col+11] uses a one-layer convolution block for mod- 
eling sentences to perform many NLP tasks. Yu et al. [Yu+14] also use a one-layer 
CNN to model a classifier to select question answer mappings. Kalchbrenner et al. 
extend the idea to form a dynamic CNN by stacking CNN and using dynamic k- 
max pooling operations over long sentences [KGB 14b]. This research significantly 
improved over existing approaches at the time in much short text and multiclass 
classification. Kim extends the single-block CNN by adding multiple channels in 
the input and multiple kernels of various lengths to give higher-order combinations 
of ngrams [Kim14b]. This work also performed various impact analysis of static vs. 
dynamic channels, the importance of max pooling, and more to detect which basic 
blocks yielded lower error rates. Yin et al. extend Kim’s multi-channel, variable ker- 
nel framework to use hierarchical CNN, obtaining further improvements [YS16]. 
Santos and Gatti use character-to-sentence level representation with CNN frame- 
works for effective sentiment classification [SG14]. Johnson and Zhang explore the 
usage of region embeddings for effective short text categorization, due to the ability 
to capture contexts over a larger span where word embeddings fail [JZ15]. Wang and 
others perform semantic clustering using density peaks on pre-trained word embed- 
dings forming a representation they call semantic cliques, with such semantic units 
used further with convolutions for short-text mining [Wan-+ 15a]. 

Zhang et al. explore the use of character-level representations for a CNN instead 
of word-level embeddings [ZZL15]. On a large dataset, character-level modeling of 
sentences performs very well when compared to traditional sparse representations 
or even deep learning frameworks, such as word embedding CNNs or RNNs. Con- 
neau et al. design a very deep CNN along with modifications such as shortcuts to 
learn more complex features [Con+16]. Xiao and Cho’s research further extends 
the character level encoding for entire document classification task when combined 
with RNNs with a lower number of parameters [XC16]. 
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6.7.2 Text Clustering and Topic Mining 


In their work, Xu et al. use a CNN for short-text clustering in a completely unsuper- 
vised manner [Xu+17]. The original keyword features from the text are used to gen- 
erate compact binary code with locality-preserving constraints. Deep feature repre- 
sentation is obtained using word embeddings with the dynamic CNN in [KGB 14b]. 
The outputs of the CNN are made to fit the binary codes during the training process. 
The deep features thus evolved from the CNN layers are passed to normal cluster- 
ing algorithms, such as k-means, to give the final clustering of the data. Experiments 
show that this method does significantly better than traditional feature-based meth- 
ods on various datasets. 

Lau et al. jointly learn the topics and language models using CNN-based 
frameworks that result in better coherent, effective, and interpretable topic mod- 
els [LBC17]. Document context is captured using a CNN framework, and the 
resulting document vectors are combined with topic vectors to give an effective 
document-topic representation. The language model is composed of the same doc- 
ument vectors from above and LSTMs. 


6.7.3 Syntactic Parsing 


In seminal work by Collobert et al., many NLP tasks, such as POS Tagging, Chunk- 
ing, Named Entity Resolution, and Semantic Role Labeling are performed for the 
first time using word embeddings and CNN blocks [Col+11]. The research shows 
the strength of using CNNs in finding features in an automated way rather than hand- 
crafted task-specific features used for similar tasks. Zheng et al. show that character- 
based representations, CNN-based frameworks, and dynamic programming can be 
very effective in performing syntactic parsing without any task-specific feature engi- 
neering in languages as complex as Chinese [Zhe+15]. Santos and Zadrozny show 
that using character-level embeddings jointly with word-level embeddings and deep 
CNNs can further improve POS Tagging tasks [DSZ14] in English and Portuguese. 
Zhu et al. propose a recursive CNN (RCNN) to capture complex structure in the de- 
pendency trees. An RCNN has a basic k-ary tree as a unit that can capture the parsing 
tree with the relationship between nodes and children. This structure can be applied 
recursively to map the representation for the entire dependency tree. The RCNN is 
shown to be very effective as a re-ranking model in dependency parsers [Zhu+15]. 


6.7.4 Information Extraction 


As discussed in Chap. 3, Information Extraction (IE) is a general category which 
has various sub-categories, such as entity extraction, event extraction, relation ex- 
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traction, coreference resolution, and entity linking to name a few. In the work of 
Chen et al., instead of using hand-coded features, the researchers employ word- 
based representations to capture lexical and sentence level features that work with 
a modified CNN for superior multiple event extractions in text [Che+15]. The re- 
searchers use word context, positions, and event type in their representations and 
embeddings flowing into a CNN with multiple feature maps. Instead of a CNN 
with max pooling, which can miss multiple events happening in a sentence, the 
researchers employ dynamic multi pooling. In dynamic multi pooling, the feature 
maps are split into three parts, finding a maximum for each part. 

As discussed above in the section on NLP, Zheng et. al [Zhe+15] and Santos et. 
al [DSZ14] employ CNNs for relation classification without any hand-coded fea- 
tures. In the work of Vu et al., relation classification uses a combination of CNNs 
and RNNs. In a sentence which has entities and relations between them, the re- 
searchers perform a split between left and middle, and middle and right part of the 
sentence, flowing into two different word embeddings and CNN layers with max 
pooling. This design gives special attention to the middle part, which is an impor- 
tant aspect in relation classification as compared to previous research. Bi-directional 
RNNs with an additional hidden layer are introduced to capture relation arguments 
from succeeding words. The combined approach shows significant improvements 
over traditional feature-based and even independently used CNNs and RNNs. 

Nguyen and Grisham use a CNN-based framework for relation extraction 
[NGI5b]. They use word embeddings and position embeddings concatenated as 
the input representation of sentences with entities having relations. They employ a 
CNN with multiple filter sizes and max pooling. It is interesting to see that the per- 
formance of their framework is better than all handcrafted feature engineering-based 
machine learning systems that use many morphological and lexical features. 

In the work of Adel et al., the researchers compare many techniques from a tra- 
ditional feature-based machine learning to CNN-based deep learning for relation 
classification in the context of slot filling [AS17a]. Similar to the above work, they 
break the sentences into three parts for capturing the contexts and use a CNN with 
k-max pooling. 


6.7.5 Machine Translation 


Hu et al. highlight how CNNs can be used to encode both semantic similarities and 
contexts in translation pairs and thus yield a more effective translations [Hu+15]. 
Another interesting aspect of this research is its employment of a curriculum train- 
ing, where the training data is categorized from easy to difficult, and uses phrase to 
sentence for contexts encoding for effective translations. 

Meng et al. build a CNN-based framework for guiding signals from both source 
and target during machine translation [Men+15]. Using CNNs with gating provides 
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guidance on which parts of source text have influence on the target words. Fusing 
them with entire source sentence for context yields a better joint model. 

Gehring et al. use a CNN with an attention module, thus showing not only a 
fast performing and parallelizable implementation, but also a more effective model 
compared to LSTM-based ones. Using word and positional embeddings in the input 
representation, stacking the filters on top of each other for hierarchical mappings, 
gated linear units, and multi-step attention module gives the researchers a real edge 
on English-French and English-German translation [Geh+17b]. 


6.7.6 Summarizations 


Denil et. al show that by having a hierarchy of word embeddings that compose 
sentence embeddings, which in turn compose document embeddings, and using dy- 
namic CNNs gives useful document summarizations, as well as effective visualiza- 
tions [Den+ 14]. The research also highlights that the composition can capture from 
low-level lexical features to high-level semantic concepts in various tasks, including 
summarization, classification, and visualization. 

Cheng and Lapata develop a neural framework combining CNN for hierarchi- 
cal document encoding and attention extractor for effective document summariza- 
tions [CL16]. Mapping the representations very close to the actual data, where there 
is the composition of words into sentences, sentences to paragraphs, paragraphs to 
entire document using CNN, and max pooling gives the researchers a clear advan- 
tage in capturing both local and global sentential information. 


6.7.7 Question and Answers 


Dong et al. use multi-column CNNs for analyzing questions from various aspects 
such as answer path, answer type, and answer contexts [Don+15b]. The embed- 
ding of answer layers using entities and relations using low-dimensional embedding 
space is utilized along with a scoring layer on top to rank candidate answers. The 
work shows that without hand-coded or engineered features, the design provides a 
very effective question answering system. 

Severyn and Moschitti show that using relational information given by the 
matches between the words used in the question and answers with a CNN-based 
framework gives very effective of question-answering system [SM15]. If we can 
map the question to finding facts from the database, Yin et al. show in their work 
that a two-stage approach using a CNN-based framework can yield excellent re- 
sults [Yin+16b]. The facts in the answers are mapped to subject, predicate, and 
object. Then entity linking from the mention in the question to the subject employ- 
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ing a character-level CNN is the first stage of the pipeline. Matching the predicate 
in the fact with the question using a word-level CNN with attentive max pooling is 
the second stage of the pipeline. 


6.8 Fast Algorithms for Convolutions 


CNNs, in general, are more parallel in nature as compared to other deep learning 
architectures. However, as training data size has increased, need for faster predic- 
tions in near real-time, and GPU based hardware for parallelizing operations are be- 
coming more widespread, convolution operations in CNN have gone through many 
enhancements. In this section, we will discuss some fast algorithms for CNN and 
give insights into how convolutions can be made faster with fewer floating point 
operations [LG16]. 


6.8.1 Convolution Theorem and Fast Fourier Transform 


This theorem states that convolution in the time domain (any discrete input such as 
image or text) is equivalent to pointwise multiplications in the frequency domain. 
We can represent this transformation as taking fast Fourier transform (FFT) of the 
input and the kernel, multiplying it and taking inverse FFT. 


(fx g)(t) =F"! (F(£)-F(g)) (6.42) 


Convolution operations is an n= algorithm, whereas it has been shown that 
FFT is nlog(n). Thus, for a larger sequence even with 2 operations of FFT and 
inverse FFT, it can be shown that n+ 2nlog(n) <n’, thus giving a significant 
speedup. 


6.8.2 Fast Filtering Algorithm 


Winograd algorithms use computational tricks to reduce the multiplications in con- 
volution operations. For example, if a 1d input data of size n needs to be convolved 
with a filter of size r to give an m-size output, then it will take m x r multiplica- 
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tions in normal. The minimal filtering algorithms F(m,r) can be shown to need 
only u(F(m,r)) =m-+r-—1 multiplications. Let us consider a simple example with 
input (do, d,,d2,d3| of size 4 convolved with filter |go, 21, g2] of size 3 to give size 2 
outputs [77,2]. In traditional convolution, we would need 6 multiplications but by 
arranging the inputs as shown: 


so] 
fas [| ~ [mmm (649) 
where 
m = (do — d2)g0 (6.44) 
ma Ga) se (6.45) 
ms = (dy — dy) 8081 82) (6.46) 
m4 = (di — d3)g0 (6.47) 


Thus, multiplications are reduced to only m+r-— 1, 1.e., 4, and savings are only 
¢ = 1.5. The additions gorei +82) and gon 81 +82) can be precomputed from the fil- 
ters, giving additional performance benefits. These fast filtering algorithms can be 
written in matrix form as: 


Y = AT|(Gg) 0 (Bd), (6.48) 


where o is element-wise multiplication. For the 1d example, the matrices are 


10 -1 0 100 
O01 1 0 ta a 
Bee ot 4 (OS (6.49) 
a 
01 0-1 001 
111 0 
= 
A =lo1-1 | (6.50) 
g= [20 g1 g2|' (6.51) 
d = [do d) dz d3|" (6.52) 


The 2d minimal algorithms can be expressed in terms of nested 1d algorithms. 
For example, F(m,r) and F'(n,s) can be used to compute m x n outputs for a filter 
of size r x s. Thus 
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U(F(m x n,r x s)) = UF (m,r))u(F(n,s)) =(m+r—1)(n+s—1) (6.53) 

Similarly, the matrix form for a 2d algorithm can be written as: 
Y = A'[(G'gG) o (B'dB)|A‘ (6.54) 


A filter g is now of size r x r and each input can be considered to be a tile of 
dimension (m+r—1) x (m+r-—1). Generalization to a non-square matrix can be 
done by nesting F(m,r) and F(n,s) as above. Thus, for F(2 x 2,3 x 3), a normal 
convolution will use 4 x 9 = 36 multiplications, whereas fast filter algorithms need 


. . 36 __ 
only (2+3—1) x (2+3-—1) = 16, giving a savings of 72 = 2.25. 


Here are some practical tips in regards to CNNs especially for classification 
tasks. 


e For the classification task, it is always good to start with Yoon Kim et al. 
proposed CNN with word representation [Kim14b]. 

e Using pre-trained embeddings with word2vec or GloVe as compared to 1- 
hot vector representation as a single static channel for mapping sentences 
should be done before fine-tuning or introducing multiple channels. 

e Having multiple filters such as |3,4,5], number of feature maps ranging 
from 60 to 500, and ReLU as the activation function often gives a good 
performance [ZW 17]. 

e ji-max pooling as compared to average pooling and k-max pooling gives 
better results [ZW17] 

e Choice of regularization technique, 1.e., L1 or L2 or dropout, etc. depends 
on the dataset and always good to try without the regularization and with 
regularization and compare the validation metrics. 

e Learning curves and variances in them across multiple cross-validation 
gives an interesting idea of “robustness” of the algorithm. Flatter the curves 
and smaller the variances, highly likely the validation metric estimates are 
accurate. 

e Understand the predictions from the model on the validation data to look 
for the patterns in false positives and false negatives. Is it because of 
spelling mistakes? Is it because of dependencies and orders? Is it because 
of lack of training data to cover the cases? 

e Character-based embeddings can be useful provided there is enough train- 
ing data. 

e Adding other structures such as LSTM, hierarchical, attention-based 
should be done incrementally to see the impact of each combination. 
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6.9 Case Study 


To get hands-on experience on a real-world data analysis problem that involves 
many of the techniques and frameworks described in this chapter, we will use senti- 
ment classification from text. In particular, we will utilize on the public U.S. airline 
sentiment dataset, scraped from Twitter for classifying tweets as positive, negative, 
or neutral. Negative tweets can be further classified for their reason. 

We will evaluate the effectiveness of different deep learning techniques involv- 
ing CNNs with various input representations for sentiment classification. In this case 
study, the classification will be based on the text of the tweet only, and not on any 
tweet metadata. We will explore various representations of text data, such as word 
embeddings trained from the data, pre-trained word embeddings, and character em- 
beddings. We have not done a lot of hyperparameter optimization for each method 
to show the best it can produce without further fine-tuning. Readers are welcome to 
use the notebook and code to explore fine-tuning themselves. 


6.9.1 Software Tools and Libraries 


First, we need to describe the main open source tools and libraries we will use for 
our case study. 


e Keras (www.keras.io) is a high-level deep learning API written in Python which 
gives a common interface to various deep learning backends, such as Tensor- 
Flow, CNTK, and Theano. The code can run seamlessly on CPUs and GPUs. All 
experiments with CNN are done using Keras API. 

e TensorFlow (https://www.tensorflow.org/) 1s a popular open source machine 
learning and deep learning library. We use TensorFlow as our deep learning li- 
brary but Keras API as the basic API for experimenting. 

e Pandas (https://pandas.pydata.org/) is a popular open source implementation for 
data structures and data analysis. We will use it for data exploration and some 
basic processing. 

e scikit-learn (http://scikit-learn.org/) is a popular open source for various ma- 
chine learning algorithms and evaluations. We will use it only for sampling and 
creating datasets for estimations in our case study. 

e Matplotlib (https://matplotlib.org/) is a popular open source for visualization. 
We will use it to visualize performance. 


Now we are ready to focus on the four following sub-tasks. 


Exploratory data analysis 

Data preprocessing and data splits 
CNN model experiments and analysis 
Understanding and improving models 
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6.9.2 Exploratory Data Analysis 


The total data has 14,640 labeled data, 15 features/attributes of which only the at- 
tribute text will be used for learning. The classes are in three categories of positive, 
negative, and neutral. We will take 15% of the total data for testing from the whole 
dataset in a stratified way and similarly 10% from training data for validation. Nor- 
mally cross-validation (CV) is used for both model selection and parameter tuning, 
and we will use the validation set to reduce the time to run. We did compare CV 
estimates with separate validation set and both looked comparable. 

Figure 6.23 shows the class distribution. We see that there is a skew in class 
distribution towards negative sentiments as compared to positive and neutral. 


Class Distribution 
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Fig. 6.23: Number of instances across different classes 


One interesting step in EDA is to plot the word cloud for positive and negative 
sentiment data from the entire dataset to understand some of the most frequent words 
used and maybe correlating with that sentiment. Tokens that are over-represented 
in the cloud are mostly adjectives such as “thanks,” “great,” “good,” “appreciate,” 
etc. in the positive tweets while the negative sentiment word cloud in has reasons 
“luggage,” “canceled flight,’ “website,” etc. as shown in Fig. 6.24. 


99 66 


6.9.3 Data Preprocessing and Data Splits 


We perform some further basic data processing to remove stop words and mentions 
from the text as they are basically not useful in our classification task. The listing 
below shows the basic data cleanup code. 





i|# remove stop words with exceptions 
o>} def remove_stopwords(input_text): 


: stopwords_list = stopwords.words(’ english’ ) 

4 # Some words which might indicate a certain sentiment are 
kept 

5 Whitelist [net .  HOL2 > lo ®| 


6 words = input_text. split () 
7 clean_words = [word for word in words if ( 


302 6 Convolutional Neural Networks 





Fig. 6.24: Word Cloud for negative sentiments 


8 word not in stopwords_list or word in whitelist) and 
len(word) > 1] 
9 return ” ”.join(clean_words ) 


o|] # remove mentions 


3) def remove_mentions(input_text): 


oe 


4 return re.sub(r’@\w+’, , input_text) 


7] tweets = tweets [[TEXT.-COLUMN_NAME, LABEL-COLUMN_NAME ] | 

s}| tweets [TEXT.COLUMN_NAME] = tweets [TEXT.COLUMN_NAME ]. apply ( 
9 remove-_stopwords ).apply (remove_mentions ) 

20] tweets . head () 








Next, we will split the entire data into training and testing with 85% for training and 
15% for testing. We will build various models using same training data and evaluate 
with respect to same test data to get a clear comparison. 






i| X_train, X_test, y_train, y_test = train_test_split ( 
2 tweets [TEXT.-COLUMN_NAME], tweets [LABEL-COLUMN_NAME] , 


test_size=0.15, random_state =37) 


We then perform tokenization, splits of training data into training and validation and 
sequence mapping with fixed size. We use the maximum text length in our corpus 
to determine the sequence length and do padding. 


| # tokenization with max words defined and filters to remove 
Cliatacters 

tk = Tokenizer (num_words=NB_WORDS, 

3 til ters =") aN eke ee et a 

4 lower=True , 

5 Splat =) 

6 tk. fit_on_texts (X_train ) 


Nw 
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&s FeUMders pand sine sequence distrib mt Lom 
9 seq-lengths = X_train.apply(lambda x: len(x.split(’ ’))) 


| # COMVErt traim and) best tO Sequence usine the tokenizer 
trained on the training data 

2 X_train_total = tk.texts_to_sequences(X_train) 

3 X_test_total = tk.texts_to_sequences(X_test) 


5 # pad the sequences to a maximum length 
6 X_train_seq = pad_sequences(X_train_total , maxlen=MAX_LEN) 
7 X_test_seq = pad_sequences(X_test_total , maxlen=MAX_LEN) 


9 # perform encoding of 
o le = LabelEncoder () 


i y_train_le = le. fit_transform(y_train ) 

2 y_test_le = le.transform(y-test) 

3 y_train_one_hot = to_categorical(y_train_le ) 
4 y_test_one_hot = to_categorical(y_test_le) 


6.9.4 CNN Model Experiments 


Once we have preprocessed and created training, validation and test sets, we will 
perform various modeling analysis on the data. We will first do some basic analysis 
using simple CNN and then proceed to run various configurations and modifications 
of CNN discussed in the chapter. 

Next, we show a basic code of CNN which input layer with maximum length of 
sentence 24, which outputs a 100-dimensional vectors which are convoluted with 
64 filters each of height or size 3, equivalent to 3-gram, going through a ReLU non- 
linear activation function, then a max pooling layer that gets flattened so that it is 
input of a fully connected layer which outputs to soft max layer with 3 outputs for 
3 classes. 


| # basic CNN Model to understand how it works 
2 def base_cnn_model(): 
3 # Embedding 


~ 


4 # Layer—>Convolution! D—>MaxPooling!D—>Flatten —> 
Bully Connected — © las sit ter 
5 model = Sequential ( 


6 [Embedding (input_dim=10000, output_dim=100, 
input_length=24), 


7 Convolution|ID( filters =64, kernel_size=3, padding=’ 
Same , activation—= relu’ ),; 

8 MaxPooling1ID() , 

9 Flatten(), 


0 Dense(100, activation=’relu’), 
Dense(3, activation=’ softmax ’) ]) 
2 model . summary () 

return model 
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6 #train and validate 





7 base_cnn_model = base_cnn_model () 
s base_history = train_model ( 

9 base_cnn_model , 

20 X_train_seq , 

21 y-train , 

22 X_valid_seq , 

23 y-valid) 


A single channel using pre-trained embeddings and Yoon Kim’s CNN model 
with multiple filters is shown here 


1 # create embedding matrix for the experiment 
2 emb_matrix = create_embedding_matrix(tk, 100, embeddings ) 


4 # single channel CNN with multiple filters 


7 def single_channel_kim_cnn(): 


8 text_seq_input = Input(shape=(MAX_LEN,) , dtype=’int32’ ) 
9 text_embedding = Embedding (NB_WORDS + 1, 

0 EMBEDDING_DIM,, 
weights=[emb_matrix ], 

2 trainable=True, 


input_length=MAX LEN) ( 
text_seq_input ) 








5 frlveresizes = 3-54) 5) 
6 convs = [] 
7 7 parallel layers for each i1liter size with convid and: max 
pooling 
8 for filter=size im tlterssizes 
9 l_conv = Convolution1D ( 
20 filters =12 5. 
21 KermelestZe=Tilteresi ze. 
22 padding=’same’ , 
23 activation=’relu’)(text_embedding ) 
24 I_pool = MaxPoolingID(filter_size)(l_conv) 
25 convs.append(l_pool ) 
26 # concatenate outputs from all cnn blocks 
27 merge = concatenate(convs, axis=1) 
28 convol = ConvolutionID(128, 5, activation=’relu’)(merge) 
29 pooll = GlobalMaxPooling!ID()(convol ) 
30 dense = Dense(128, activation=’relu’, name=’ Dense’ )(pooll ) 
31 Ff clas stiication. layer 
32 out = Dense(3, activation=’ softmax’ )(densSe ) 
model = Model ( 
34 inputs=[text_seq_input ], 
35 outputs=out , 
36 name=” KimSingleChannelCNN” ) 
37 model . summary () 


38 return model 
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40 
4. Single_channel_kim_model = single_channel_kim-_cnn () 
a single_channel_kim_model_history = train_model ( 
single_channel_kim_model , X_train_seq, y-_train , 
X_valid_seq, y_valid) 





We will list different experiments with its name and their purpose before high- 
lighting the results from each. 


1. Base CNN. A basic single block of CNN with convolution with filter of size 3, 
max pooling and a softmax layer. 

2. Base CNN + Dropout. To see the impact of dropout on the base CNN. 

3. Base CNN + Regularization. To see the impact of L2 regularization on the 
base CNN. 

4. Multi-filters. To see the impact of adding more filters [2,3,4,5] to CNN. 

5. Multi-filters + Increased Maps. To see the impact of increasing the filter maps 
from 64 to 128. 

6. Multi-filters + Static Pre-trained Embeddings. To see the impact of using 
pre-trained word embeddings in CNN. 

7. Multi-filters + Dynamic Pre-trained Embeddings. To see the impact of using 
pre-trained word embeddings in CNN that are trained on the training set. 

8. Yoon Kim’s Single Channel. Single channel CNN using widely known archi- 
tecture [Kim14b]. 

9. Yoon Kim’s Multiple Channel. Multiple channel CNN using widely known 
architecture [Kim14b] to see the impact on increasing the channels. Static and 
dynamic embeddings are used as two different channels. 

10. Kalchbrenner et al. Dynamic CNN. Kalchbrenner et al. based dynamic CNN 
with K-max pooling [KGB 14b]. 

11. Multichannel Variable MVCNN. We use two embedding layers with static 
and dynamic channels [YS16]. 

12. Multigroup MG-CNN. We use three different channels (two inputs with em- 
bedding layers using GloVe and one using fastText) with different dimensions 
(100 and 300) [ZRW 16]. 

13. Word-level Dilated CNN. Exploring the concept of dilations with the reduced 
parameters and larger coverage using word-level inputs [YK15]. 

14. Character-level CNN. We explore the character-level embeddings instead of 
the word-level embeddings [ZZL15]. 

15. Very Deep Character-level CNN. Impact of a very deep level CNN with mul- 
tiple layers [Con+16]. 

16. Character-level Dilated CNN. Exploring the concept of dilations with the re- 
duced parameters and larger coverage using character-level inputs [YK15]. 

17. C-LSTM. Exploring C-LSTM to verify how CNN can be used to capture the 
local features of phrases and RNN to capture global and temporal sentence se- 
mantics [Zho+15]. 

18. AC-BiLSTM Exploring the bi-directional LSTM with CNN [LZ16]. 
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We use some practical deep learning aspects while training the models as high- 
lighted below: 


lf Use the validation less 1O° detect=the best werehts to be 


saved 

> checkpoints .append( ModelCheckpoint(checkpoint_file , monitor=’ 
val_loss’, verbose=0, save_best_only=True , 
save_weights_only=True , mode=’ auto’, period=1)) 


3 # output to TensorBoard 

4 checkpoints .append(TensorBoard(log_dir=’./logs’, write_graph= 
True, write_images=False ) ) 

5 # if no improvements in 10 epochs, then quit 

6 checkpoints .append(EarlyStopping (monitor=’val_loss’, patience 
=10)) 


In the Table 6.1 we will highlight the results of running different CNN architectures 
given above and the results we track with accuracy and average precision. 


Table 6.1: CNN test results summary 


Experiments Accuracy % Average precision % 
Base CNN qd 82 
Base CNN + dropout 70.85 78 
Base CNN + regularization 78.32 83 
Multi-filters 80.55 86 
Multi-filters + increased maps 79.18 85 
Multi-filters + static pre-trained embeddings 77.41 84 
Multi-filters + dynamic pre-trained embeddings 78.96 85 
Yoon Kim’s Single Channel 79.50 85 
Yoon Kim’s Multiple Channel 80.05 86 
Kalchbrenner et al. dynamic CNN 78.68 85 
Multichannel variable MVCNN 79.91 85 
Multigroup CNN MG-CNN 81.96 87 
Word-level dilated CNN 77.81 84 
Character-level CNN 73.36 81 
Very deep character-level CNN 67.89 73 
Character-level dilated CNN 74.18 78 
C-LSTM 79.14 85 
AC-BiLSTM 79.46 86 


Bold indicates best result or accuracy amongst all the experiments 


We will list some high level analysis and observations from Table 6.1 and our 
analysis from various runs below: 


e Basic CNN with L2 regularization seems to improve on overfitting from 
both the angles of reducing the loss and cutting the max loss. Dropout 
seems to hurt the performance of basic CNN. 


6.9 Case Study 307 


e Multiple layers and multiple filters seem to improve both accuracy and 
average precision by more than 2%. 

e Using pre-trained embeddings which get trained in the data gives one of 
the best performances and is very much in line with many research. 

e Multigroup Norm constraint MG-CNN show the best results in both ac- 
curacy and average precision in word-based representation. Using three 
embedding channels with two different embeddings with different sizes 
seems to give the edge. 

e Yoon Kim’s model with two channels has second best performance and 
confirms that it should be always a model to try in classification problems. 
The performance of dual channel along with MG-CNN confirms that in- 
creasing number of channels helps the model in general. 

e Increasing the depth and complexity of CNN and hence the parameters 
has not much effect on generalization, again can be accounted for small 
training data size. 

e Character-based representation shows relatively poor performance and that 
is well in line with most research because of limited corpus and training 
SIZe. 

e Introducing complexity by combining CNN and LSTM does not improve 
the performance and again can be attributed to the task complexity and the 
size of training data. 


6.9.5 Understanding and Improving the Models 


In this section, we will give some practical tips and tricks to help the researchers 
gain insights into model behaviors and improve them further. 

One way of understanding the model behavior is to look at the predictions at various 
layers using some form of dimensionality reduction techniques and visualization 
techniques. To explore the behavior of last but one layer before the classification 
layer, we first create a clone of the model by removing the last layer and using the 
test set to generate high dimensional outputs from this layer. We then use PCA to get 
30 components from 128-dimensional outputs and finally project this using TSNE. 
As shown in Fig. 6.25, we see a clear reason why Yoon Kim’s Single Channel 
CNN performs better than the Basic CNN. 

Next, we will analyze the false positives and false negatives to understand the pat- 
terns and the causes, to improve the models further. 


Table 6.2 highlights some of the text and probable cause. Having words such 
as late flight, overhead, etc. are so overrepresented in the negatives that it 
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Fig. 6.25: A layer before last with 128 dimensions is used to visualize the test data 
with PCA and TSNE. (a) Hidden layer from Basic CNN. (b) Hidden layer from 
Yoon Kim’s Single Channel CNN 


causes even sentences which have these to be classified negatives. Adding 
more positives with similar language and using average pooling might help. 
Adding support for emojis and even embeddings that have been trained on 
them can improve further on examples that use them. 


Table 6.2: False negatives 


Probable cause 


Kudos ticket agents making passengers Negative |Keyword 
check bags big fit overhead overrepresented 


Thankful united ground staff put last Negative |Keyword 
seat last flight out home late flight still home overrepresented 


Emoji love flying 





Table 6.3 highlights some of the text and probable causes. Having words such 
as awesome, thanks, etc. are so overrepresented in the positives that it causes 
even sentences which have these to be classified positives. Adding more neg- 
atives with similar language and using average pooling might help. Having 
sarcasm based datasets, training embeddings on them and using it as input 
channel can improve the performance. 
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Using Lime for model explanations, especially false positives and false negatives 
gives insights into the reason with weights associated with the keywords as shown 
in Fig. 6.26. 


Table 6.3: False positives 
Probable cause 


Forget reservations thank great company Positive |Keyword 
i've flighted flight once again thank you overrepresented and sarcasm 


Thanks finally made it and missed meetings now|Positive | Keyword 
aeneecnnnncnnnnnnmnne ee loveneesemed and sarcasm 
My flight cancelled led Positive | Keyword overrepresented 

mess please thank awesome out hase Pose ae 


i def keras_wrapper(texts ): 
2 _seq = tk.texts_to_sequences (texts ) 
_text_data = pad_sequences(_seq, maxlen=MAX_LEN) 
4 return single_channel_kim_model. predict(-_text_data) 





6 exp = explainer.explain_instance(’forget reservations thank 
great company i have cancelled tliehted alieht fonce 
again thank you’, 

7 keraS_wrapper , 

s num_features=10, 

9 labels=[0, 2] 

10) 


flight . =m Negative 

i - == Positive 

again + 
forget 
company 
flighted 


Features 


reservations + 
cancelled 
great 


thank 





Fig. 6.26: Lime outputs weights for words for both positive and negative class 


6.9.6 Exercises for Readers and Practitioners 


Some other interesting problems and research questions that readers and practition- 
ers can further attempt are listed below: 
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1. What is the measurable impact of preprocessing, such as removing stop words, 
mentions, stems, and others on the CNN performance? 

2. Does the embedding dimension have an impact on the CNN; for example, a 
100-dimensional embedding vs. a 300-dimensional embedding? 

3. Does the type of embeddings such as word2vec, GIloVE, and other word em- 
beddings change the performance significantly across CNN frameworks? 

4. Is there an impact in performance when multiple embeddings such as word, 
POS Tags, positional are used for sentence representation with CNN frame- 
works? 

5. Does hyperparameter tuning done more robustly across various parameters 1m- 
prove validation and hence the test results significantly? 

6. Many researchers use the ensemble of models, with different parameters as well 
as with different model types. Does that improve performance? 

7. Do pre-trained character embeddings improve performance as compared to one 
that we tune on the limited training data? 

8. Using some of the standard CNN frameworks such as AlexNet, VGG-16, and 
others with modifications for text processing and doing a survey of these on the 
dataset seems like further interesting research. 


6.10 Discussion 


Geoffery Hinton in his talk “What is wrong with convolutional neural nets?” given 
at MIT highlights some of the issues with CNNs, especially around max pooling. 
The talk explains how max pooling can “ignore” some of the important signals be- 
cause of the bias it has towards finding the “key” features. Another issue highlighted 
was around how the filters can capture different features and build higher level fea- 
tures but fail to capture the “relationship” between these features to some extent. An 
example being, presence of features detecting eye, ears, mouth in face recognition 
using CNNs with different filters and layers cannot separate out an image with pres- 
ence of these but not in the right place. Capsule Networks, using capsules as a basic 
building block is seen as an alternative design to overcome these issues [SFH17]. 
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Chapter 7 ®@ 
Recurrent Neural Networks Check for 


7.1 Introduction 


In the previous chapter, CNNs provided a way for neural networks to learn a hierar- 
chy of weights, resembling that of n-gram classification on the text. This approach 
proved to be very effective for sentiment analysis, or more broadly text classifica- 
tion. One of the disadvantages of CNNs, however, is their inability to model contex- 
tual information over long sequences.! In many situations in NLP, it is desirable to 
capture long-term dependencies and maintain the contextual order between words to 
resolve the overall meaning of a text. In this chapter, we introduce recurrent neural 
networks (RNNs) that extend deep learning to sequences. 

Sequential information and long-term dependencies in NLP traditionally relied 
on HMMs to compute context information, for example, in dependency parsing. 
One of the limitations of using a Markov chain for sequence focused tasks is that 
the generation of each prediction is limited to a fixed number of previous states. 
RNNs, however, relax this constraint, accumulating information from each time step 
into a “hidden state.’ This allows sequential information to be “summarized” and 
predictions can be made based on the entire history of the sequence. 

Another advantage of RNNs is their ability to learn representations for variable 
length sequences, such as sentences, documents, and speech samples. This allows 
two samples of differing lengths to be mapped into the same feature space, allowing 
them to be comparable. In the context of language translation, for example, an input 
sentence may have more words than its translation, requiring a variable number of 
computational steps. Thus, it is highly beneficial to have knowledge of the entire 
length of the sentence before predicting the translation. We will study this example 
more at the end of this chapter. 

In this chapter, we begin by describing the basic building blocks of RNNs and 
how they retain memory. We then describe the training process for RNNs and dis- 
cuss the vanishing gradient problem, regularization, and RNN variants. Next we 


' This statement is made in a basic context of CNNs and RNNs. The CNN vs. RNN superiority 
debate in sequential contexts is an active area of research. 
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show how to incorporate text input in recurrent architectures, leveraging word and 
character representations. We then introduce some traditional RNN architectures in 
NLP, and then move towards more modern architectures. The chapter is concluded 
with a case study on neural machine translation and a discussion about the future 
directions of RNNs. 


7.2 Basic Building Blocks of RNNs 


An RNN is a standard feed-forward neural network applied to vector inputs in a 
sequence. However, in order to incorporate sequential context into the next time 
step’s prediction, a “memory” of the previous time steps in the sequence must be 
preserved. 


7.2.1 Recurrence and Memory 


First we will look at the idea of recurrence conceptually. Let us define a T length 
input sequence as X, where X = {x1,X2,...,x7}, such that x; € R” is a vector input 
at time t. We then define our memory or history up to and including time f as hy.” 
Thus, we can define our output 0; as: 


O; = f (Xr, hy_1) (7.1) 


where the function f/ maps memory and input to an output. The memory from the 
previous time step is h;_, and the input is x;. For the initial case x;, ho is the zero 
vector 0. 

Abstractly, the output 0; is considered to have summarized the information from 
the current input x; and the previous history from h,;—;. Therefore, 0; can be con- 
sidered the history vector for the entire sequence up to and including time r. This 
yields the equation: 


h; = 0; = f (Xr, hy-1) (7.2) 


Here we see where the term “recurrence” comes from: the application of the same 
function for each instance, wherein the output is directly dependent on the previous 
result. 

More formally, we can extend this concept to neural networks by redefining the 
transformation function f as follows: 


h, = f(Ux, + Wh,_1) (7.3) 


* This history vector will be called the hidden state later on for obvious reasons. 
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where W and U are weight matrices W, U € R*%) and f is a non-linear function, 
such as tanh, 0, or ReLU. Figure 7.1 shows a diagram of the simple RNN that we 


have described here. 
Of O71 02 Or 
hy 
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Fig. 7.1: Diagram of a recurrent neural network 


7.2.2 PyTorch Example 


The code snippet below illustrates a PyTorch implementation of the simple RNN 
previously described. It illustrates the recurrent computation in a modern frame- 
work. 


| # PyTorch RNN Definition 

2 import torch.nn as nn 

3 from torch.autograd import Variable 
4 import torch.optim as optim 


6 class RNN(nn. Module): 


8 def __init__(self , input _size): 

9 super(RNN, self). __init__() 

10 

1 self.input_size = input _size 

12 self. hidden_size = input_size 

13 self.output_size = input _size 

14 

15 self .U = nn.Linear(input_size , self.hidden_size ) 
16 self .W = nn.Linear(self.hidden_size , self. output_size) 
17 

18 def forward(self , input, hidden): 

19 Ux = self .U(input) 

20 Wh = self .W( hidden) 

21 output = Ux + Wh 

22 return output 


2 rnn = RNN(input-_size ) 


2 # Training the network 
27 optimizer = optim.Adam(rnn.parameters(), Ilr=learning_rate , 
weight_decay=le—5) 
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0» for epoch in range(n_epoch): 

30 for data, target in train_loader: 
31 # Get samples 

32 input = Variable(data) 


34 # Forward Propagation 

35 hidden = Variable(torch. zeros(l, rnn.hidden_size ) ) 
36 for 1 in range(input.size()[0]): 

37 output = rnn(input[i], hidden) 

38 hidden = output 


40 # Error Computation 
Al loss = F.nll_loss(output, target) 








43 # Clear gradients 

44 optimizer. zero_grad () 
45 

46 # Backpropagation 

47 loss. backward () 

48 

49 # Parameter Update 

50 optimizer. step () 


In this snippet, we perform a classification (and subsequently an error computa- 
tion) at every time step. Instead of performing the computation as the outputs are 
computed, the error is computed after the forward propagation has completed for 
each time step. The error with respect to each time step is being backpropagated. 
This snippet by itself is incomplete, because our input size, output size, and hidden 
size will normally differ depending on the problem, as we will see in the upcoming 
sections. 


7.3 RNNs and Properties 


Let us now focus on a typical implementation of an RNN, how it is trained, and 
some of the difficulties that are introduced in training them. 


7.3.1 Forward and Backpropagation in RNNs 


RNNs are trained through backpropagation and gradient descent similar to feed- 
forward networks we have seen previously: forward propagating an example, cal- 
culating the error for a prediction, computing the gradients for each set of weights 
via backpropagation, and updating the weights according to the gradient descent 
optimization method. 

The forward propagation equations for h,; and the output prediction Jj, are: 


7.3, RNNs and Properties 319 
h,; = tanh(Ux, ++ Wh,_ 1) 


. (7.4) 
¥; = softmax(Vh, ) 
where the learnable parameters are U, W, and VU incorporates the information 
from x;, W incorporates the recurrent state, and V learns a transformation to the 
output size and classification. A diagram of this RNN is shown in Fig. 7.2. 
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Fig. 7.2: Forward propagation of a simple RNN 


We compute the error using cross-entropy loss at each time step t, where y; is the 
target. 


E; = —y; log y;. (7.5) 


This gives us the overall loss with the following: 
A ] A 
LYy,9) =—a LY logy. (7.6) 
i 


The gradients are computed by evaluating every path that contributed to the pre- 
diction y;. This process is called backpropagation through time (BPTT). This pro- 
cess is illustrated in Fig. 7.3. 

The parameters of our RNN are U, V, and W, so we must compute the gradient of 
our loss function with respect to these matrices. Figure 7.4 shows backpropagation 
through a step of the RNN. 


> It is common to split the single weight matrix W of an RNN in Eq. (7.3) into two separate weight 
matrices, here U and W. Doing this allows for a lower computational cost and forces separation 
between the hidden state and input in the early stages of training. 
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Fig. 7.3: Backpropagation through time shown with respect to the error at t = 3. The 
error £3 is comprised of input from each previous time step and the inputs to those 
time steps. This figure excludes backpropagation with respect to E;,F., and E4 
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Fig. 7.4: Backpropagation through a single time step of a simple RNN 


7.3.1.1 Output Weights (V) 


The weight matrix V controls the output dimensionality of y, and does not contribute 
to the recurrent connection. Therefore, computing the gradient is the same as a linear 
layer. 

For convenience, let 








q: = Vhy,. (7.7) 
Then, 
JE; = ae On, On (7.8) 
OV; os lk O4r OV; 
From our definition of E; (7.5), we have that: 
OF 
a, (7.9) 
OV, Yty 


Backpropagation through the softmax function can be computed as: 
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OS Vn Vt k # l 
OY _ J Vy Str | 7.10 
Odt, Yt, 1-%,),k=1 ( ) 


If we combine (7.9) and (7.10) we obtain the sum over all values of k to produce 
OE; . 
Ody” 


Vit a A yt a oe re A 
= (1 —Sx) 7 2 ed (—Sn.5n) = Vy TYG 7 Lud een) 
Yt, kAl Yt kAl 


= ~y, +3, Vy: (7.11b) 
k 


Recall that all y; are one-hot vectors, meaning that all values are in the vector are 
zero except for one indicating the class. Thus, the sum is 1, so 


a Sy —Yr (7.12) 


Lastly, q; = Vh;, so gr, = Vi.m/t,,- Therefore, 
Od _ 0 








— Vimh fae 
OV; ov, | ttn) — 

= 6) Simhi, (7.13b) 

= Syhi,. (7.13c) 


Now we combine (7.12) and (7.13c) to obtain: 


OF; 


re. (S15 — Yrs) Mey (7.14) 


which is recognizable as the outer product. Hence, 


OE . 
ou = (9; —y;) @hy, (7.15) 


where © is the outer product. 


7.3.1.2 Recurrent Weights (W) 


The parameter W appears in the argument for h;, so we will have to check the 
gradient in both h; and y;. We must also make note that ¥, depends on W both 
directly and indirectly (through h,_;). Let z; = Ux, + Wh,_;. Then h, = tanh(z,). 
At first it seems that by the chain rule we have: 


OE, OE; Ou, Ady, Atty 


ees (7.16) 
OW; j OV, Odt, Oh, OW; j 
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Note that of these four terms, we have already calculated the first two, and the third 
is simple: 





Odt, O 

an, aly, (Viohr,) (7.17a) 
= Vi,bO,m (7.17b) 
= Mn (7.17c) 


The final term, however, requires an implicit dependence of h; on W, ; through h,—; 
as well as a direct dependence. Hence, we have: 


Oh, Oh, Oh, Ohy-1, 
— 



































, 7.18 
OW:; Wij Ohy1, oW,, vee) 
But we can just apply this again to yield: 
Oh, Oh, Oh, Ohy-1, Oh, Ohy-1, Ahy-2, (7.19) 


OW; OWiz Ohi, OW  Ahy-1, Ah-2, OW; ) 


This process continues until we reach hy i which was initialized to a vector of 

















zeros (0). Notice that the last term in (7.19) collapses to ae se and we can 
“nN 1, J] 
turn the first term into se a . Then, we arrive at the compact form: 
nN 1] 
Oh Oh, dh 
St ee (7.20) 


OW;;  oOh,, OW; ;° 


where we sum over all values of r less than ¢ in addition to the standard dummy 
index n. More clearly, this is written as: 


MNty a Ny, Or, 
OW; j 7 r=—0 hy, OW; j 7 











(7.21) 


This term is responsible for the vanishing/exploding gradient problem: the gradient 


exponentially shrinking to O (vanishing) or exponentially growing larger (explod- 


a by the term ae means that the product 


will be smaller if both terms are less than | or larger if the terms are greater than 1. 
We will address this problem in more detail momentarily. 
Combining all of these yields: 








ing). The multiplication of the term 


OE; ,, -, Oh, Ah, 
Ow; ; _ (Si, ~ Ya) Vim Dy Oh, OW; 
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(7.22) 
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7.3.1.3 Input Weights (U) 


Taking the gradient of U is similar to doing it for W since they both require taking 
sequential derivatives of the h; vector. We have: 
OE, = OE; Dn, Od Oh, 
OU; OV, Od Oh, OU;,; 











(7.23) 


Note that we only need to calculate the last term now. Following the same procedure 
as for W, we find that: 
Dit — Wt Phy, 




















= 7.24 
BU;; 24 dh, U;;" Wee 
and thus we have: 
OE, ! Oh, Oh, 
= (Vf, — Vim m 7. 7.25 
OU; j (5, Yr) l, py hy, OU; ( ) 


The difference between U and W appears in the actual implementation since the 


Oh», . 
OW. differ. 








hr, 
values of WU; and 


7.3.1.4 Aggregate Gradient 


The error for all time steps is the summation of F;’s according to our loss func- 
tion (7.6). Therefore, we can sum the gradients for each of the weights in our net- 
work (U, V, and W) and update then with the accumulated gradients. 


7.3.2 Vanishing Gradient Problem and Regularization 


One of the most difficult parts of training RNNs is the vanishing/exploding gradient 
problem (often referred to as just the vanishing gradient problem).* During back- 
propagation, the gradients are multiplied by the weight’s contribution to the error 
at each time step, shown in Eq (7.21). The impact of this multiplication at each 
time step dramatically reduces or increases the gradient propagated to the previous 
time step which will in turn be multiplied again. The recurrent multiplication in the 
backpropagation step causes an exponential effect for any irregularity. 


e If the weights are small, the gradients will shrink exponentially. 
e If the weights are large, the gradients will grow exponentially. 


In the case when the contribution is very small, the weight update may be a negligi- 


ble change, potentially causing the network to stop training. Practically, this usually 


4 The tanh activation function bounds the gradient between 0 and 1. This has the effect of shrinking 
the gradient in these circumstances. 


324 7 Recurrent Neural Networks 


leads to underflow or overflow errors when not considered. One way to alleviate this 
issue is to use the second-order derivatives to predict the presence of vanishing/ex- 
ploding gradients by using Hessian-free optimization techniques. Another approach 
is to initialize the weights of the network carefully. However, even with careful ini- 
tialization, it can still be challenging to deal with long-range dependencies. 

A common initialization for RNNs is to initialize the initial hidden state to 0 . The 
performance can typically be improved by allowing this hidden state to be learned 
[KB14]. 

Adaptive learning rate methods, such as Adam [KB14], can be useful in recur- 
rent networks, as they cater to the dynamics of individual weights, which can vary 
significantly in RNNs. 

There are many methods used to combat the vanishing gradient problem, many 
of them focusing on careful initialization or controlling the size of the gradient be- 
ing propagated. The most commonly used method to combat vanishing gradients 
is the addition of gates to RNNs. We will focus more on this approach in the next 
section. RNN sequences can be very long. For example, if an RNN used for speech 
recognition samples 20 ms windows with a stride of 10 ms will produce an output 
sequence length of 999 time steps for a 10-s clip (assuming no padding). Thus, the 
gradients can vanish/explode very easily [BSF94b]. 


7.3.2.1 Long Short-Term Memory 





Fig. 7.5: Diagram of an LSTM cell 


Long short-term memory (LSTM) utilizes gates to control the gradient propa- 
gation in the recurrent network’s memory [HS97b]. These gates (referred to as the 
input, output, and forget gates) are used to guard a memory cell that is carrying the 
hidden state to the next time step. The gating mechanisms are themselves neural 
network layers. This allows the network to learn the conditions for when to forget, 
ignore, or keep information in the memory cell. Figure 7.5 shows a diagram of an 
LSTM. 
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The LSTM cell is formally defined as: 


i; = o( W;,x; + U;h,_; +b;) 
f, = o(W x; + Uphy_; + by) 
0; = 0(W,x; + U,h;_; + bo) 
¢, = tanh(W-x; + U-h,_1) 
G=—fo¢_7+bh 0¢ 

h; = 0; o tanh(¢, ) 


(7.26) 


The forget gate controls how much is remembered from step to step. Some recom- 
mend initializing the bias of the forget gate to | in order for it to remember more 
initially [Haf17]. 


7.3.2.2 Gated Recurrent Unit 
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Fig. 7.6: Diagram of a GRU 


The gated recurrent unit (GRU) is another popular gating structure for RNNs 
[Cho+14]. The GRU combines the gates in the LSTM to create a simpler update 
rule with one less learned layer, lowering the complexity and increasing efficiency. 
The choice between using LSTM or GRU is largely decided empirically. Despite a 
number of attempts to compare the two methods, no generalizable conclusion has 
been reached [Chu+14]. The GRU uses fewer parameters, so it is usually chosen 
when performance is equal between the LSTM and GRU architectures. The GRU is 
shown in Fig. 7.6. The equations for the update rules are shown below: 


Z, = O(W-x,; + U_h,_1) 

r, = 0(W,x, + U-h;_1) 

h, = tanh(W,x; + U,hy_1 o1;) 
h, = (1—z,) oh, +z, *hy_1 


(7.27) 
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In the GRU, the new candidate state, h,, is combined with the previous state, 
with z; determining how much of the history is carried forward or how much the 
new candidate replaces the history. Similar to setting the LSTM’s forget gate bias 
for improved memory in the early stages, the GRU’s reset gate biases can be set to 
—]1 to achieve a similar effect [Haf17]. 


7.3.2.3 Gradient Clipping 


A simple way to limit gradient explosion is to force the gradients to a specific range. 
Limiting the gradient’s range can solve a number of problems, specifically prevent- 
ing overflow errors when training. It is typically good practice to track the gradient 
norm to understand its characteristics, and then reduce the gradient when it exceeds 
the normal operating range. This concept is commonly referred to as gradient clip- 
ping. 

The two most common ways to clip gradients are: 


e ZL», norm clipping with a threshold ¢. 


(7.28) 


V new = Veutient O 


L(V) 
e Fixed range 


tmin ifV< tmin 
View — V (7.29) 
tmax if V > tmax 


With a maximum threshold tax and minimum threshold tin. 


7.3.2.4 BPTT Sequence Length 


The computation involved in recurrent network training depends heavily on the 
number of time steps in the input. One way to fix/limit the amount of computa- 
tion in the training process is to set a maximum sequence length for the training 
procedure. 

Common ways to set the sequence length are: 


e Pad training data to the longest desired length 
e Truncate the number of steps backpropagated during training. 


In the early stages of training, overlapping sequences with truncated backpropa- 
gation can help the network converge quicker. Increasing the truncation length as 
training progresses can also help convergence in the early stages of learning, partic- 
ularly for complex sequences, or when the maximum sequence length in a dataset 
is quite long. 
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Setting a maximum sequence length can be useful in a variety of situations. In 
particular, when: 


e static computational graph requires a fixed size input, 
e the model is memory constrained, or 
e gradients are very large at the beginning of training. 


7.3.2.5 Recurrent Dropout 


Recurrent networks, like other deep learning networks, are prone to overfitting. 
Dropout, being a common regularization technique, is an intuitive choice to apply 
to RNNs as well, however, the original form must be modified. If the original form 
of dropout is applied at each step, then the combination of masks can cause little 
signal to be passed over longer sequences. Instead, we can reuse the same mask at 
each step [SSB16] to prevent loss of information between time steps. 

Additional techniques such as variational dropout [GG16] and zoneout [Kru+16] 
have similar aims, by dropping out input or output gates in LSTMs or GRUs. 


7.4 Deep RNN Architectures 


As with the entire field of deep learning, many of the architectures and techniques 
are an area of active research. In this section, we describe a few architectural variants 
to illustrate the expressive power and extensions of the basic RNN concepts that 
have been introduced so far. 


7.4.1 Deep RNNs 


Just as we have stacked multiple fully connected and convolutional layers, we can 
also stack layers of recurrent networks [EHB96]. The hidden state in a stacked RNN 
composed of / vanilla RNN layers can be defined as follows: 


ni) — ¢ (w fh? shy) ) (7.30) 


where ni!) is the output of the previous RNN layer at time r. This is illustrated 


in Fig. 7.7. Anecdotally, when convolutional layers were stacked, the network was 
learning a hierarchy of spatially correlated features. Similarly, when recurrent net- 
works are stacked it allows longer ranges of dependencies and more complex se- 
quences to be learned [Pas+13]. 
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Because the weights in RNNs are quadratic in size, it can also be more efficient 
to have multiple smaller layers rather than larger ones. Another benefit 1s computa- 
tional optimization for fused RNN layers [AKB 16]. 

A common problem with stacking RNNs, however, is the vanishing gradient 
problem due to the depth and number of time steps. However, RNNs have been 
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Fig. 7.7: Diagram of a stacked RNN with / = 2 


able to gain inspiration from other areas of deep learning, incorporating residual 
connections and highway networks seen in deep convolutional networks. 


7.4.2 Residual LSTM 


In Prakash et al. [Pra+16], the authors used residual connections between layers of 
the LSTM to provide a stronger gradient to lower layers for the purpose of para- 
phrase generation. Residual layers, typically applied in convolutional networks, al- 
low “residuals” of lower level information to pass on to later layers of the network. 
This provides lower level information to higher layers and also allows a larger gra- 
dient to be passed to the earlier layers, because there is a more direct connection to 
the output. In Kim et al. [KEL17], the authors used residual connections to improve 
word error rates on a deep speech network and concluded that the lack of accu- 
mulation on the highway path, while using a projection matrix to scale the LSTM 
output. 
In the LSTM definition in Eq. (7.26), h, is changed to: 


h, = 0; : (W, : tanh(c; ) + W)X;) (7.31) 


where W,, is the projection matrix and W,, is an identity matrix that matches the 
sizes of x; to h;. When the dimensions of x; and h;, are the same, this equation 
becomes: 


h, = O; : (W, tanh(c; ) + X;). (7.32) 


Note that the output gate is applied after the addition of the input x;. 
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7.4.3 Recurrent Highway Networks 


Recurrent highway networks (RHN) [Zil+16] offer an approach to gate the gradient 
propagation between recurrent layers in multilayer RNN architectures. The authors 
present an extension of the LSTM architecture that allows for gated connections 


hé hi hg 
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Fig. 7.8: A two layer residual LSTM 


between the recurrent layers, allowing an increase in the number of layers that can 
be stacked for deep RNNs. 
For an RHN with L layers and output g(Z) the networks 1s is described as: 
s( _ n() mu a (I) mu 
(7) (1-1) 
h; ’ = tanh Wax lay + Rzis; + Dai 


- (7.33) 
he =O (Wrxd pr +R, 1) +b,1) 


o() =O (Wex Lan +Ras) +bc') 


with 1 denoting the indicator function. 

A number of useful properties are gained from RHNs, specifically that the Jaco- 
bian eigenvalue is regulated across time steps, facilitating more stable training. The 
authors reported impressive results on a language modeling task using a 10 layer 
deep RHN. 


7.4.4 Bidirectional RNNs 


So far we have only considered the accumulation of a memory context in the forward 
direction. In many situations it is desirable to know what will be encountered in 
future time steps to inform the prediction at time ¢. Bidirectional RNNs [SP97] 
allow for both the forward context and “backward” context to be incorporated into 
a prediction. This is accomplished by running two RNNs over a sequence, one in 
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Fig. 7.9: A diagram of a two layer highway LSTM. Not that the highway connection 
uses a learned gate along the connection to the next layer 


the forward direction and one in the backward direction. For an input sequence 


X = {x1,X2,...,x7}, the forward context RNN receives the inputs in forward order 
p= 12. ce and the backward context RNN receives the inputs in reverse order 
b= AT, T - ., 1}. These two RNNs together constitute a single bidirectional 


layer. Figure . 10 shows a diagram of a bidirectional RNN. 
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Fig. 7.10: Diagram of a bidirectional RNN. Here the outputs are concatenated to 
form a single output vector holding the forward and backward context 





The output of the two RNNs, h/ and h’, is often joined to form a single out- 
put vector either by summing the two vectors, concatenating, averaging, or another 
method. 

In NLP, there are many uses for this type of structure. For example, this has 
proven very useful for the task of phoneme classification in speech recognition, 
where knowledge of the future context can better inform the predictions at any for- 
ward time step. Bidirectional networks typically outperform forward-only RNNs in 
most tasks. Furthermore, this approach can be extended to other forms of recurrent 
networks such as bidirectional LSTMs (BiLSTM). These techniques follow logi- 
cally, with one LSTM network operating over the inputs in the forward direction 
and another with the inputs in the reverse direction, combining the outputs (con- 
catenation, addition, or another method). 

One limitation of bidirectional RNNs is that full input sequence must be known 
before prediction, because the reverse RNN requires x7 for the first computation. 
Thus bidirectional RNNs cannot be used for real-time applications. However, de- 
pending on the requirements of the application, having a fixed buffer for the input 
can alleviate this restriction. 
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7.4.5 SRU and Quasi-RNN 


The recurrent connections restrict the amount of computation that can be paral- 
lelized, because the information must be processed sequentially. Thus, the com- 
putational cost of RNNs is high compared to CNNs. Two techniques introduced 
to speed up computation involve eliminating some of the sequential dependencies. 
These techniques allow networks to become much deeper for a lower computa- 
tional cost. The first technique introduces a semi-recurrent unit (SRU) [LZA17]. 
The approach processes the input at each time step simultaneously and applies a 
light-weight recurrent computation afterward. The SRU incorporates skip and high- 
way connections to improve the feature propagation in the network. The SRU is 
defined as: 


X; — Wx; 
je Oo (Wx; + Dr) 
r, = o(W,x,b,) (7.34) 


C; =f 0C{ +(1 — f,) oX, 
h; =r; 0 g(¢;) + (1 —r;) ox; 


where f is the forget gate, r is the reset gate, and c is the memory cell. 

This approach was applied to text classification, question answering, language 
modeling, machine translation, and speech recognition, achieving competitive re- 
sults with a reduction in training times by up to 10x over the LSTM counterpart. 

The quasi-recurrent neural network (QRNN) [Bra+16] is a different approach 
with the same goal. The QRNN applies convolutional layers to parallelize the input 
computation that is being supplied to the reduced recurrent component. This net- 
work was applied to the task of sentiment analysis and also achieved competitive 
results with a significant reduction in training and prediction time. 


7.4.6 Recursive Neural Networks 


Recursive neural networks (RecNN) are a generalized form of recurrent neural net- 
works that allow effective manipulation of graphical structures. A recursive neural 
network can learn information related to labeled directed acyclic graphs, while re- 
current networks only process ordered sequences [GK96]. In NLP, the main appli- 
cation for recursive neural networks is dependency parsing [SMN10] and learning 
morphological word vectors [LSM 1 3b]. 

The data is structured as a tree with parent nodes at the top and children nodes 
stemming from them. The aim is to learn the appropriate graphical structure of 
the data by predicting a tree and reducing the error with respect to the target tree 
Structure. 
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Fig. 7.11: Diagram of a recursive neural network 


For simplicity, we consider a branching factor of 2 (2 children for each parent). 
The structure prediction, a recursive neural network aims to achieve two outputs: 


e A semantic vector representation, p(x;,x;), merging the children nodes ¢; and ¢; 
e A score s indicating how likely the children nodes are to be merged. 


The network can be described as follows: 
sij = Up(c,¢;) 


7.35 
P(ci,¢;) = f(WIe;;¢;| +b) a 


where W is the weight matrix for the shared layer and U 1s the weight matrix for the 
score computation. 
The score of a tree is the sum of the scores at each node: 


S= >». & (7.36) 


ne€nodes 


The error computation for recursive neural networks uses max-margin parsing: 


E= SY) 5(Xi, Yi) — max (s(x;,y) +A(y,y;)) (7.37) 
i yEA (xj) 


(Xj 


The quantity A (y, y;) computes the loss for all incorrect decisions 
Backpropagation through structure (BPTS) similar to BPTT computes the deriva- 
tives at each node in the graph. The derivatives are split at each node and passed on 
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to the children nodes. In addition to the gradient with respect to the predicted node, 
we also compute the gradient with respect to the score values as well. 

LSTM and GRU units have also been applied to recursive networks to combat the 
vanishing gradient problem [TSM15]. Recursive networks have been used in areas, 
such as relation classification [Soc+12], sentiment analysis [Soc+13], and phrase 
similarity [TSM15]. 

Recursive neural networks demonstrate powerful extensions to sequence-based 
neural architectures. Although their use has decreased in popularity with the in- 
troduction of attention-based architectures, the concepts they present for improved 
computational efficiency are useful. 


7.5 Extensions of Recurrent Networks 


Recurrent neural networks can be used to accomplish many types of sequence prob- 
lems. Until now, we have focused on a many-to-many example with a one-to-one 
mapping from an input to an output with the same number of time steps. However, 
RNNs can be used for many types of sequence oriented problems by modifying 
where the error is computed. Figure 7.12 shows the types of sequence problems that 
can be solved with RNNs. 
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Fig. 7.12: Recurrent neural networks can address a variety of sequence-based prob- 
lems. (a) shows a one-to-one sequence (this would be equivalent to deep neural 
networks with shared weights). (b) illustrates a one-to-many sequence task, gener- 
ating a series of outputs given one input. (c) is a many-to-one task, which could 
represent a text classification task, predicting a single classification after the en- 
tire text has been seen. (d) shows a many-to-many sequence task with a one-to-one 
alignment between the number of input and output time steps. This structure is com- 
mon in language modeling. (e) shows a many-to-many without a specific alignment 
between the input and output. The number of inputs and outputs steps can also be 
different lengths. This technique is commonly referred to as sequence-to-sequence 
and is commonly seen in neural machine translation 
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RNNs have tremendous flexibility and can be extended to address a wide range of 
sequence tasks. The limitations of feed-forward neural networks remain: a tendency 
to overfit without proper regularization, need for large datasets, and computational 
requirements. Additionally, sequence models introduce other considerations, such 
as vanishing gradients with longer sequence lengths and “forgetting” of earlier con- 
text. These difficulties have led to various extensions, best practices, and techniques 
to alleviate these issues. 


7.5.1 Sequence-to-Sequence 


Many NLP and speech tasks are sequence oriented. One of the most common ar- 
chitectural approaches for these tasks 1s sequence-to-sequence, often abbreviated 
seq-to-seq or seq2seq. The seq-to-seq approach resembles an autoencoder, having a 
recurrent encoder and a recurrent decoder, shown in Fig. 7.13. The final hidden state 
of the encoder functioning as the “encoding” passed to the decoder; however, it is 
typically trained in a supervised fashion with a specific output sequence. The seq-to- 
seq approach was born out of neural machine translation, having an input sentence 
in one language and a corresponding output sentence in a separate language. The 
aim is to summarize the input with encoder and decode to the new domain with the 
decoder. 
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Fig. 7.13: Seq-to-seq model with an RNN-based encoder and decoder. The first 
hidden state of the decoder is the last hidden state of the encoder (shown in yellow) 





One difficulty of this approach is that the hidden state tends to reflect the most 
recent information, losing memory of earlier content. This forces a limitation for 
long sequences, where all information about that sequence must be summarized 
into a single encoding. 
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7.5.2 Attention 


Forcing a single vector to summarize all the information from the previous time 
steps is a drawback. In most applications, the information from a generated sequence 
from the decoder will have some correlation with the input sequence. For example, 
in machine translation, the beginning of the output sentence likely depends on the 
beginning of the input sentence and less so on the end of the sentence, which has 
been seen more recently. In many situations, it would be helpful to have not only the 
summarized knowledge, but also the ability to focus on different parts of the input 
to better inform the output at a particular time step. 

Attention [BCB 14a] has been one of the most popular techniques to address this 
issue by paying specific attention to parts of the sequence for each word in the output 
sequence. Not only does this allow us to improve the quality of our predictions, it 
also allows insight into the network by viewing what inputs were relied upon for the 
prediction. 

If s; 1s the attention augmented hidden state at time /, it takes three inputs: 


the previous hidden state of the decoder s;_1, 

the prediction from the previous time step y;_1, and 

a context vector ¢; which weighs the appropriate hidden states for the given time 
Step. 


Si = f (Si-1, Si, Ci) (7.38) 


The context vector, c;, is defined as: 
7, 
max — marginparsing; = y Oj jhj. (7.39) 
j=l 


where the attention weights are: 


0i;; = om (7.40) 
Die OXP(eik) 
and 
= a(s;-1,h;). (7.41) 


The function a(s,h) is referred to as the alignment model. This function scores how 
influential input h; should be on the output at position 7. 

It is fully differentiable and deterministic because it is considering all time steps 
that have contributed to the output. A drawback of using all of the previous time 
steps is that it requires a large amount of computation for long sequences. Other 
techniques relax this dependency by being selective about the number of states that 
inform the context vector. Doing this creates a non-differentiable loss, however, and 
training requires Monte Carlo sampling for the estimation of the gradient during 
backpropagation. 
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An additional benefit of attention is that it provides a score for each time step, 
identifying what inputs were most useful for the prediction. This can be very useful 
for interpretability when inspecting the quality of a network or gaining intuition 
about what the model is learning, as shown in Fig. 7.15. Attention mechanisms are 
covered in more detail in Chap. 9. 
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Fig. 7.14: Attention is applied to the first step of decoding for a neural machine 
translation model. A similarity score is computed for the hidden state at each time 
step in the encoder and the current hidden state of the decoder. These scores are 
used to weigh the contribution of that time step. These weights are used to produce 
the context vector that is supplied to the decoder 


7.5.3 Pointer Networks 


Pointer networks [VFJ15] are an application of attention-based, sequence-to- 
sequence models. Contrary to other attention-based models, it selects words (points) 
to be used as the output instead of accumulating the input sequence into a context 
vector. The output dictionary in this scenario must grow with the length of the input 
sequence. To accommodate this, an attention mechanism is used as a pointer, rather 
than mixing the information for decoding. 
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ui = v! tanh(W,e; + W2od;) 


| (7.42) 
P(C|Ci,...,Ci-1,p) = softmax(u') 


where e; is the output of the encoder at time j € {1,...,2}, d; is the decoder output at 
time step 7, and C; is the index at time 7, and v, W,, and W>2 are learnable parameters. 
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Fig. 7.15: Attention weights on an English-to-French machine translation task. No- 
tice how the attended area of the network is correlated with the output sequence 


This model showed success finding planar convex hulls, computing Delaunay 
triangulations, and producing solutions to the traveling salesman problem. 


7.5.4 Transformer Networks 


The success of attention on seq-to-seq tasks prompts the question of whether it can 
be directly applied to the input, reducing or even eliminating the need for recurrent 
connections in the network. Transformer networks [Vas+17b] applied this attention 
directly to the input with great success, beating both recurrent and convolutional 
models in machine translation. Instead of relying on RNNs to accumulate a memory 
of previous states as in sequence-to-sequence models, the transformer uses “multi- 
headed” attention directly on the input embeddings. This alleviates the sequential 
dependencies of the network allowing much of the computation to be performed in 
parallel. 

Attention is applied directly to the input sequence, as well as the output sequence 
as it is being predicted. The encoder and decoder portions are combined, using an- 
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other attention mechanism before predicting a probability distribution over the out- 
put dictionary. 

Multi-head attention, shown in Fig. 7.16, is defined by three input matrices: Q 
the set of queries packed into a matrix, keys K, and values V. 


. QK™ 
Attention(Q,K, V) = softmax | —— ] V (7.43) 


Vd 


Multi-head attention is then defined as: 
MultiHead(Q, K, V) = Concat(head),...,head,)W? (7.44) 
where 
head;(Q,K, V) = Attention (owe, Kw, vv?) | (7.45) 


The parameters of all W matrices are projection matrices. 
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Fig. 7.16: Illustration of scaled dot-product attention referred to as attention in the 
text and multi-head attention. (a) Scaled dot-product attention, (b) multi-head atten- 
tion 


The encoder and decoder apply multiple layers of multi-head attention with resid- 
ual connections and additional fully connected layers. Because much of the compu- 
tation is happening in parallel, a masking technique and offsetting are used to ensure 
that the network only uses information that is available up to time t — 1 when pre- 
dicting for time t. The transformer network reduces the number of steps required for 
prediction that significantly improve the computation time, while achieving state-of- 
the-art results on the translation task. 
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7.6 Applications of RNNs in NLP 


Incorporating text into recurrent networks is a straight-forward process, resembling 
the CNN classification in the previous chapter. The words of a sentence are con- 
verted into word embeddings and passed as a time series into our network. In this 
case we do not have to worry about a minimum length to our sequence, because the 
word context is learned in the RNN’s memory rather than as a combination of the 
inputs. 

In Yin et al. [Yin+17], the authors do a wide comparison of CNN and RNN ar- 
chitectures for a variety of NLP tasks such as text classification, entailment, answer 
selection, and POS tagging. In this work the authors train basic CNN and RNN ar- 
chitectures, showing that RNNs perform well on most tasks, with CNNs proving 
superior only in certain matching cases where the main features are essentially key 
phrases. Overall, CNNs and RNNs have different methods of modeling sentences. 
CNNs tend to learn features similar to n-grams, while RNNs aim to maintain long- 
range dependencies for defining context. 


7.6.1 Text Classification 


Figure 7.17 shows the structure of a simple text classification task for an input sen- 
tence. With a recurrent network we are able to sequentially encode the word em- 
beddings at each time step. Once the entire sequence is encoded, we use the last 
hidden state to predict the class. The network is trained using BPTT and learns to 
sequentially weigh the words for the classification task. 


0800008 
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Fig. 7.17: Simple RNN-based text classifier for sentiment classification 


In Lee and Dernoncourt [LD 16], the authors compared CNN and RNN architec- 
tures for short-text classification. The addition of sequential information via CNN 
and RNN architectures significantly improved the results on dialog act characteri- 
zation. 
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In sentiment classification, Wang et al. [Wan+15b] encoded tweets using an 
LSTM network to predict sentiment. In their work they showed the robustness of 
RNNs to capture complexities contained within the structure of the tweets, particu- 
larly the effect of negation phrases, such as when the word not negated a phrase. In 
Lowe et al. [Low+ 15], the authors introduced an architecture called dual-LSTM for 
semantic matching. This architecture encodes questions and answers and uses the 
inner product of the question and answer vector to rank the candidate responses. 


7.6.2 Part-of-Speech Tagging and Named Entity Recognition 


In Huang et al. [HX Y15], word features and embeddings were applied to POS, NER, 
and chunking tasks with a bidirectional LSTM with CRF to boost performance. In 
Ma and Hovy [MH16] a bidirectional LSTM was used for end-to-end classification 
for POS on WSJ. Ma and Hovy [MH16] used an end-to-end method to improve on 
these results. Their method does not rely on context features that were applied in 
other works, such as POS, lexicon features, and task-dependent preprocessing. In 
Lample et al. [Lam-+16b], a bidirectional LSTM was used in conjunction with CRF 
to achieve state-of-the-art performance on NER in four languages on the CoNLL- 
2003 dataset. This work also extended the base RNN-CRE architecture to stack- 
LSTM (LSTM units used to mimic a stack data structure with pushing and pulling 
capabilities). Character embeddings are often incorporated in addition to word em- 
beddings to capture additional information about a word’s semantic structure as well 
as to inform predictions on OOV words. 


7.6.3 Dependency Parsing 


In Dyer et al. [Dye+15], stack-LSTMs that allow for pushing and pulling operations 
were used to predict dependency parsing for variable-length text by predicting the 
dependency tree transitions. Kiperwasser and Goldberg [KG16] simplified the archi- 
tecture by removing the need for the stack-LSTM, relying on bidirectional LSTMs 
to predict the dependency tree transitions. 


7.6.4 Topic Modeling and Summarization 


In Ghosh et al. [Gho+ 16], the contextual LSTM (C-LSTM) was introduced for word 
prediction, sentence selection, and topic prediction. The C-LSTM concatenates a 
topic embedding with the word embedding at each time step in the training of the 
network. This work functions similarly to language model training in that the goal 
is to predict the next word; however, it is extended to include the topic context into 
the prediction as well. Thus, the aim is to predict the next word and the topic of the 
sentence so far. 
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7.6.5 Question Answering 


In Tan et al. [Tan+ 15], the authors train a question RNN and an answer RNN to yield 
a respective embedding for each. The two networks are then trained simultaneously 
by using a hinge loss objective to enforce a cosine similarity between the two most 
probable pairs. Another approach, dynamic memory networks [XMS16], incorpo- 
rates a series of components to make up a question answering system. This system 
used a combination of recurrent networks and attention mechanisms to construct 
input, question, and answer modules that utilize an episodic memory to condition 
the predictions. 


7.6.6 Multi-Modal 


The effectiveness of deep learning in other applications such as images and video 
has led to a variety of multi-modal applications. These applications require gener- 
ating language based on an input medium. These applications include image and 
video captioning, visual question answering, and visual speech recognition. 

Image captioning was one of the first ways that deep convolutional networks for 
images were combined with text. In Vinyals et al. [Vin+15b], the authors utilized 
a pre-trained convolutional network for image classification to generate an image 
embedding for the initial state of an LSTM network. The LSTM network was trained 
to predict each word of the caption. The initial approach led to advancements in 
RNN architectures [Wan-+ 16a]. 

Video captioning showed a similar development, with [Ven+ 14] utilizing a pre- 
trained CNN model to extract image features for each video frame to be used as 
input into a recurrent network for text generation. Pan et al. [Pan+15a] extended 
this method by striding over the output frames of the earlier recurrent layers to 
create a “hierarchical recurrent neural encoder” to reduce the number of time steps 
considered in the output layer of the stacked RNN. 

In visual question answering, language generation is used to generate an answer 
for a textual question related to a visual input. An end-to-end approach was shown 
with the neural-image-QA network in Malinowski et al. [MRF15], where the input 
image and question conditioned the LSTM network to generate a textual answer. 


7.6.7 Language Models 


In the previous chapters we have briefly discussed language models. Recall that a 
language model provides a way to determine the probability of a sequence of words. 
For example, an n-gram language model determines the probability of a sequence of 
words P(w1,...,Wm) by looking at the probability of each word given its n preceding 


words: 
m 


P(Wy.eesWmn) © | [P0wil We (ntyseees Wit): (7.46) 
i=| 
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Language models are particularly interesting in NLP as they can provide addi- 
tional contextual information to situations where a prediction might be semantically 
similar, yet syntactically different. In the case of speech recognition, two words that 
sound the same such as “to” and “two” have different meanings. But the phrase “set 
a timer for to minutes” doesn’t make sense, whereas “set a timer for two minutes” 
does. 

Language models are often used when language is being generated and for do- 
main adaptation (where there may be large amounts of unlabeled text data and lim- 
ited labeled data). The concept of n-gram language models can also be implemented 
with RNNs, which benefit from not having to set a hard cutoff for the number of 
grams considered. Additionally, similar to word vectors, these models can be trained 
in an unsupervised manner over a large corpus of data. 

The language model is trained to predict the next word in the sequence given 
the previous context, 1.e., the hidden state. This allows the language model to target 


learning: 


m 


P(w,.--,Wm) = [] P(wilwi, ..., wi-1). (7.47) 
i=! 


An example of an RNN-based language model is shown in Fig. 7.18. 
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Fig. 7.18: An RNN language model trained to predict the next word in the sequence 
given the entire history of the sequence. Note that each time step is focused on 
classification, therefore the target outputs are the size of the vocabulary, not the size 
of the input word embeddings 


In language modeling, a good practice is to have a single embedding matrix for 
both the input and the output sequence, allowing parameters to be shared, reducing 
the total number of parameters that need to be learned. Additionally, introducing 
a “down-projection” layer to reduce the state of a large RNN is typically useful 
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when the output contains a large number of elements. This projection layer reduces 
the size of the final linear projection, as is often the case in language modeling 
[MDB 17]. 


7.6.7.1 Perplexity 


Perplexity is a measure of how well a model can represent the domain, shown by 
its ability to predicting a sample. For language models, perplexity can quantify the 
language model’s ability to predict the validation or test data. The language model 
performs well if it produces a high probability for a sentence in the test set. Perplex- 
ity is the inverse probability normalized by the number of words. 

We can define the perplexity measure for a test set of sentences (s1,..., 5m) with: 


PP(51,...,8m) = 27M dle 082 Plsi) (7.48) 


where M is the vocabulary size of the test set. Because perplexity gives the inverse 
probability of the dataset, a lower perplexity implies a better result. 


7.6.7.2 Recurrent Variational Autoencoder 


Recurrent variational autoencoders (RVAE) are an extension of recurrent language 
models [KW13, RM15]. The goal of a RVAE is to incorporate variational infer- 
ence in the training process of the autoencoder to capture global features in latent 
variables. In Bowman et al. [Bow-+15], the authors utilized a VAE architecture to 
generate sentences from a language model. 


7.6.8 Neural Machine Translation 


Machine translation has been one of the largest benefactors of the success of recur- 
rent neural networks. Traditional approaches were based around statistical models 
that were computationally expensive and required heavy domain expertise to tune 
them. Machine translation is a natural fit for RNNs because input sentences may 
differ in length and order from the desired output. Early architectures for neural ma- 
chine translation (NMT) relied on a recurrent encoder—decoder architecture. A very 
simple illustration of this is shown in Fig. 7.19. 

NMT takes an input sequence of words X = (x,...,%m) and maps them to an 
output sequence Y = (y1,...,¥,). Note 1 is not necessarily the same as m. Using 
an embedding space, input X is mapped to a vector representation that is utilized 
by a recurrent network to encode the sequence. A decoder then uses the final RNN 
hidden state (the encoded input) to predict the translated sequence of words, Y (some 
have also shown success with subword translation [DN17]). 
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Fig. 7.19: Diagram of a one hidden layer encoder—decoder neural machine trans- 
lation architecture. Note how the input and output sequences can have different 
lengths, and are truncated when the end of sentence (<EOS>)tag is reached 


It is often beneficial to reinforce the sequence as it is being predicted by the 
decoder network. Passing the predicted output sequence as input, as shown in 
Fig. 7.20, can improve predictions. During training, the ground truth can be passed 
as the input into the next time step at some frequency. This referred to as “teacher 
forcing,” because it is using the true predictions to help when training. The alterna- 
tive is to use the decoder’s predicted output, which can cause difficulty in converging 
in the early stages of training. Teacher forcing is phased out as training continues, 
allowing the model to learn the appropriate dependencies. Scheduled sampling is a 
way to combat this issue by switching between predicting with the targets and the 
network output. 
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Fig. 7.20: A one hidden layer encoder—decoder neural machine translation architec- 
ture with the previous word of the prediction being used as the input at the following 
time step. Note that the embedding matrix has entries for both languages 
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In practice, the encoder and decoder do not need to be more than 2—4 layers deep, 


and bidirectional encoders usually outperform unidirectional ones [Bri+17]. 


7.6.8.1 BLEU 


The most common metric used to evaluate machine translation is BLEU. BLEU 
(bilingual evaluation understudy) is a quality evaluation metric for machine trans- 
lation, designed to align with human evaluations of natural language. It allows a 
translation to be compared with a set of target translations to evaluate the quality. 


The score is bound between 0 and 1 with higher values indicating better per- 


formance. Often in literature, the score will be multiplied by 100 to approximate a 
percentage correlation. At its core the BLEU score is a precision measurement. It 
computes the precision for reference n-grams in the targets. 


i) 


A perfect match would look like the following: 


from nlitk.translate.bleu_score import sentence_bleu 


targetsc—ul( io. had @ 9a. “cup | “of 4. “black, 4. “cosice:) ~ 
a 5 inte = eins |] 
3 prediction = [’1’, ’had’, ’a’, ’cup’, ’of’, ’black’, ’coffee’, 
air US eRe | 
score = sentence_bleu(targets , prediction) x* 100 


6 


print (score ) 
> 100.0 


Alternatively, if none of the reference words are present in the prediction, then 


we get a score of 0. 


i) 


No 


from nltk.translate.bleu_score import sentence_bleu 


barsets, — laa. vhad@ 6 "a. “cup — Ob. black —** cotice 4) * 
ia ines sem eccibe =|) 

prediction = [’what’, ’are’, ’we’, ’doing ’] 

score = sentence_bleu(targets , prediction) x* 100 


print (score ) 
Ean) 
If we change one or two words in the transcript, then we see a drop in the score. 


from nltk.translate.bleu_score import sentence_bleu 


hae Crees) ee, ditad ie sae oe “cup a Ob black ss. cotec 25 
at’, ’the’, ’cafe’]] 
3 Ppredictiom—-— [a | had’... “a°. “cup. of’. “black”. “tea 7 ~ 
AL nis Be | 
score = sentence_bleu(targets , prediction) x* 100 


print (score ) 


> 65.8037 


9 9 9 2) 


tarcsets = (“had a 4 “cup ~~ of 4 black’. “cortce | 
ae ne re elins |] 
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0 Predivetronte= iii. Jhiad’ “a so CCipe 0 ee Black 7. Satea-s.. 7 
at’, ’the’, ’house’] 
i score = sentence_bleu(targets , prediction) x« 100 


2 print(score) 





, See 


In these examples, BLEU-1 score is presented; however, higher n-grams would 
be given a better indicator of the quality. BLEU-4 is commonly seen in NMT, giving 
the correlation when considering the 4-gram precision between the hypothesis and 
the target translation. 


7.6.9 Prediction/Sampling Output 


There are a variety of ways to evaluate the output of a language model. 


7.6.9.1 Greedy Search 


If we predict the most likely word at each step, we may not yield the best sequence 
probability over all. The best decision early in the process may not maximize the 
overall probability of the sequence. In fact there is a decision tree of possibilities 
to be decoded for the best possible outcome. Because of the tree-like structure in 
language model outputs, there are a variety of methods to parse them. 


7.6.9.2 Random Sampling and Temperature Sampling 


Another way we can parse the output of our model is by using a random search. In 
a random search, the next word in the sentence is chosen according to the probabil- 
ity distribution of the next state. The random sampling technique can help achieve 
diversity in the results. However, sometimes the predictions for language models 
can be very confident, making the output results look similar to the greedy search 
results. A common way to improve the diversity of the predictions is to use a con- 
cept called temperature. Temperature is a method that exponentially transforms the 
probabilities and renormalizes to redistribute the highest probabilities among the top 
classes. 

One method of sampling from the language model is to use “temperature sam- 
pling.” This method selects an output prediction by applying a freezing function, 
defined by: 

J 
fi(P)i = 4 (7.49) 
Da Pj 
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where T € |0, 1] is the temperature parameter that controls how “warm” the predic- 
tions are. The lower the temperature, the less diverse the results. 

Another desirable quality for NLP is language generation. In Sutskever et al. 
[SVL14b], a deep RNN-based encoder—decoder architecture is used to generate 
unique sentences. The networks encode a “source” word sequence into a fixed length 
“encoding,” a vector via an RNN. The decoder uses the “encoding” as the initial hid- 
den state, and produces the response. 


7.6.9.3 Optimizing Output: Beam Search Decoding 


Greedy search makes an independence assumption between each time step for the 
decoding. We are relying on our RNNs to correctly inform the dependency between 
each time step. We can provide a prior to our predictions to ensure that we avoid 
simple errors (such as conjugation). We can do this by biasing our prediction on 
a scoring mechanism that informs whether a particular sequence is more probable 
than another. 

When using our trained model to predict on new data, we are relying on the 
model to produce the correct output given the most confident prediction. However, 
in many situations it is desirable to impose a prior on the output, biasing it towards 
a particular domain. For example, in speech recognition an acoustic model’s perfor- 
mance can be greatly improved by incorporating a language model as the bias for 
the output predictions. 

The output that we are obtaining from our machine translation model, for exam- 
ple, is a probability distribution over the vocabulary at each time step, thus creating 
a tree of possibilities for how we could parse the output. 

Often it is too computationally expensive to explore the entire tree of possibil- 
ities, so the most common search method is the beam search. Beam search is a 
searching algorithm that keeps a fixed maximum on the number of possible states 
in memory. This provides a flexible approach to optimizing the output of a network 
after it is trained, balancing speed and quality. 

If we consider the output sequence of our network (y1,...,¥m) where y; is a 
softmax output over our vocabulary, then we can compute the probability of the 
overall sequence with the product of the probabilities at each time step : 


m 


P(y1,---59m) = ] [ pi) (7.50) 
j=] 


We can decode it by conditioning our output on the probability of transitioning 
from one word to the next. 

If we have a language model that gives us the probability of a sequence of words, 
we can use this model to bias the prediction of our output by computing the proba- 
bility of the different paths that can be taken through the tree of possible transitions. 
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Let y be a sequence of words and P(y) be the probability of that sequence ac- 
cording to our language model. We will use a beam search to explore multiple hy- 
potheses of sequences at time t, H{;_1, with a beam size of k. 


6 Pee {(wi,---W;) rides (Wa aweg We) 
JH3 := {(cup of tea), (cup of coffee) } 


With beam search we keep track of our top k hypotheses, and choose the path 
that maximizes P(y). We will collect the probability of each hypothesis P(h;) in P,. 
The index order of J; and ?; should be tied to keep them in sequence when sorting. 
We begin each hypothesis with the <SOS> token and end the hypothesis once the 
<EOS> token is reached. The hypothesis with the highest score is the one that is 
selected. 


Algorithm 1: Beam Search 


Data: y, beamWidth 
Result: y with highest p(y) 


begin 

Ho = {(< SOS >)} 

Po = {0} 

for t in 1 to T do 

for h in H,_; do 
for ¥ € Y do 

y= Cee ees) 
H+ =y 
P+ = P(¥) 


H, = sort(F{,) according to highest P, 
H, = H,[1,...,beam Width] 


7.7 Case Study 


Here, we apply the concepts of recurrent neural networks for neural machine transla- 
tion. Specifically, basic RNN, LSTM, GRU, and transformer sequence-to-sequence 
architectures are explored with an English-to-French translation task. We begin 
with an exploratory process of generating a dataset for the task. Next we explore 
sequence-to-sequence architectures, comparing the effects of various hyperparame- 
ters and architecture designs on quality. 

The dataset we use is a large set of English sentences with French translations 
from the Tatoeba website. The original data is a raw set of paired examples with no 
designated train, val, and test splits, so we create these during the EDA process. 
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7.7.1 Software Tools and Libraries 


The popularity and diversity of problems sequence-to-sequence models can solve 
have led to many high-performance implementations. In this case study, we focus 
on the PyTorch-based Fairseq(-py) repository [Geh+17a], produced by Facebook 
AI Research (FAIR). This library holds implementations of many of the common 
seq-to-seq patterns with optimized dataloaders and batch support. 

Additionally, we use the PyTorch text package and spaCy [HM17] to perform 
EDA and data preparation. These packages provide many useful functions for text 
processing and dataset creation, specifically with a focus on deep learning data load- 
ers (although we do not use them here). 


7.7.2 Exploratory Data Analysis 


The raw format of the text contained in the Tatoeba dataset is a tab-separated En- 
glish sentence followed by the French translation, with one pair per line. Counting 
the number of lines gives us a total of 135,842 English-French pairs. By selecting 
a few random samples, as in Fig. 7.21, we can see that it contains punctuation, capi- 
talization, as well as unicode characters. Unicode should come as no surprise when 
considering a translation task; however, it must be considered when dealing with 
any computational representation due to variations in libraries and their support for 
unicode characters. 


Cheers! Sante | 
I want to join you Je veux me joindre 4 vous 
I was busy cooking J’ étais occupée 4 faire la cuisine 


Fig. 7.21: Examples from the English-French dataset 


7.7.2.1 Sequence Length Filtering 


First, we inspect the sequence lengths in the dataset. We use spaCy for both En- 
glish and French tokenization. The tokenizers can be applied by torchtext fields 
when reading in the data, automatically applying the tokenizer. Fields in torchtext 
are generic data types for a dataset. In our example, there are two types of fields, 
a source field represented as “SRC” which will contain details on how the English 
sentences should be processed, while a second field called the “TRG” contains the 
target French data and its type handling. We can attach a tokenizer to each as fol- 
lows. 
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1 Ge tokemizerin( te xt). 


Cy dirle dele) 


3 okenizes Frenchy text irom a strime into a list Sor strings 


Eye le JieX) 


5 return [tok.text for tok in spacy_fr.tokenizer(text) | 


7 @er tokenize emitext): 


Ped gles eh) 


9 tokenizes Enolistwetext fom sa sirimne into .4a list Ob ostrimes 


Ce dirle mele) 


i return [tok.text for tok in spacy_en.tokenizer(text) | 


3 SRC = Field(tokenize=tokenize_en, init_token=’<sos>’, 
eos_token=’<eos>’, lower=True ) 

4 TRG = Field(tokenize=tokenize_fr, init_token=’<sos>’, 
eos_token=’<eos>’, lower=True ) 

15 

io SRC. build_vocab(train_data , min_freq=0) 

i7 TRG. build_vocab(train_data , min_freq=0) 


Torchtext can take a tokenizer of any type, as it is just a function that operates 
on the text that is passed in. The tokenizers in spaCy are useful as they have stop 
words, token exceptions, and various types of punctuation handling. 

Another consideration when training sequence-based models is the length of the 
examples. We plot a histogram of the sequence lengths in Fig. 7.22. The longer sen- 
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Fig. 7.22: Histogram of sentence lengths for both English and French. Notice that a 
majority of the sentences are short, and there are very few long sentences 


tences are likely to hold a more complex structure, and likely have longer range 
dependencies. We would not expect to learn these examples, as they are under rep- 
resented in the dataset. If we desire to learn translation for longer sentences, we 
would have to collect more data, or intelligently break long examples into shorter 
ones, where we have more data. Additionally, long examples can lead to memory 
concerns with mini-batch training, as the batch size can be larger with shorter ex- 
amples. 
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For this case study, we remove longer examples by setting a threshold on the 
length of our examples. We select a limit of 20 time steps on input or output se- 
quence, which will allow for a maximum of 18 actual words in the sequences af- 
ter incorporating the < sos > and < eos > tokens. This restriction means our max 
length incorporates all of the sequence lengths that have significant data. The result- 
ing length distribution is shown in Fig. 7.23. 
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Fig. 7.23: Histogram of sentence lengths for both English and French after filtering 
the max length to 18 (20 if we include < sos > and < eos > tokens) 


After filtering the longer examples we create our training, validation, and testing 
splits, without replacement using a shuffling index technique shown below. 


i n_examples = len(all_data) 

2 1dx_array = list (range(n_examples ) ) 

3 random. shuffle (idx _array ) 

4 train_indexs = idx_array[:int(0.8*n_examples)] # 80% training 
data 

5 wal_indexs = idx_array[int(0.8*n_examples ):int(0.9* n_examples ) 


|] # 10% validation data 
6 test_indexs = idx_array[int(0.9*n_examples):] # 10% testing 
data 


This technique should provide each split of the datasets with similar characteristic. 
The final dataset allocates 80% for training, 10% for validation, and 10% for testing. 
We save the data into files so that they can be used in other experiments if desired, 
without having to repeat all the preprocessing. Inspecting the resulting data splits 
shows a similar length distribution for each, as depicted in Fig. 7.24. 


7.7.2.2 Vocabulary Inspection 


The vocabulary object offers many common NLP functions, such as indexed access 
to the terms, simplified embedding creation, and frequency filtering. 
We now load and tokenize the data splits. The overall vocabulary size is: 
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Fig. 7.24: Histogram of sentence lengths for the (a) training data, (b) validation data, 
and (c) testing data 


| train_data , valid_data, test_data = FrenchTatoeba. splits (path= 


N 


data_dir , 


Sra se — (ee ile at oli to)ae 
fields =(SRC, TRG) ) 


5 SRC. build_vocab(train_data , min_freq=0) 
6 TRG. build_vocab(train_data , min_freq=0) 
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s print(”’ English vocabulary size:”, len(SRC. vocab ) ) 
9 print(”’ French vocabulary size:”, len(TRG. vocab) ) 
10 

i > English vocabulary size: 12227 

2 > French vocabulary size: 20876 


The vocabulary frequencies are shown in Fig. 7.25. The distribution displays a “long 
tail” effect, where a small subset of tokens have high counts, for example “.” that 
occurs in almost all sentences, and other tokens that are seen just once, for ex- 
ample “stitch.” In the most extreme case of a word only appearing once, training 
relies solely on that single example to inform the model, likely leading to overfit- 
ting. Additionally, the distribution for the softmax will assign some probability to 
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Fig. 7.25: Unfiltered word frequency for (a) English and (b) French. The counts 
were sorted and placed on a log scale to capture the severity of the word representa- 
tions in this dataset. As we can see, there are many words that are used rarely, while 
a small subset is used frequently 


these terms. As the infrequent words occupy the majority of the vocabulary, much 
of the probability mass will be assigned to these terms in the early stages, slowing 
learning. A common approach is to map infrequent words to the unknown token, 
< unk >. This allows the model to ignore a likely invalid representation of an un- 
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derrepresented set of terms. We can enforce a minimum frequency by setting it as 
an argument when building the vocabulary. 

The training dataset is used to create the vocabulary (using the validation data 
is considered data snooping). We set a minimum frequency of 5 during vocabulary 
creation. Evaluating the effects of this parameter is left as an exercise. 


i SRC. build_vocab(train_data , min_freq=5) 
2 TRG. build_vocab(train_data , min_freq=5) 


When investigating the final vocabulary, we still notice that there is a long tail 
distribution for the frequency of the words, shown in Fig. 7.26. This shouldn’t be too 
much of a surprise, given that we chose a minimum frequency of 5. If the threshold 
is too high, removing many words, then the model becomes too restricted in its 
learning, with many values mapping to the unknown token. 
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Fig. 7.26: Term frequency graph for the filtered vocabulary of the (a) English and 
(b) French training data 


Figure 7.27 shows the top 50 terms for English and French in the training set. An 
analysis of the list leads to some interesting questions about the data. For example, 
one of the most common words shown in the vocabulary list is the word “‘n’t.” This 
seems odd since there is no word “n’t,’” in the English language. A deeper inspection 
reveals that spaCy tokenization splits contractions in this way, leaving an isolated 
token “n’t’” whenever a contraction such as “don’t” or “can’t” appear. The same situ- 
ation occurs when the contraction “I’m” is processed. This illustrates the importance 
of iterative improvements on data, as preprocessing is a fundamental component of 
the feature generation, as well as post-processing if results are computed on the final 
output. 
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Fig. 7.27: Frequency counts for the top 20 terms from the training set for (a) English 
and (b) French 


The final counts of our data splits are shown below. 


| Training set size: 107885 
Validation set size: 13486 

3 Testing set size: 13486 

4 Size of English vocabulary: 4755 
5 Size of French vocabulary: 6450 


No 


7.7.3 Model Training 


Now that the dataset is ready, we investigate models and their performance on the 
training and validation sets. Specifically, we focus on various simple RNNs, LSTMs, 
and GRUs. Each of these architectures is investigated with respect to learning rate, 
depth, and bidirectionality. Each technique involves optimization of multiple hy- 
perparameters to regularize the network, while also changing the training dynamics 
of the network. To alleviate a full grid search over all possible hyperparameters, 
we only tune learning rate to the introduced architecture. This does not completely 
alleviate the need to tune other parameters, but it makes the problem tractable. 

Each model we train utilizes the script shown in Fig. 7.28. Note GRU and RNN 
configurations are not implemented in fairseq. We added these to the library for the 
purposes of this comparison. 

Each model is trained for a maximum of 100 epochs. We reduce the learning 
rate when validation performance plateaus, and stop when learning plateaus. The 
embedding dimension is fixed at 256 and dropout for the input and output are set 
to 0.2. For simplicity, we fix the hidden size to 512 for all experiments (except 
for bidirectional architectures). A bidirectional provides two hidden states to the 
decoder, and therefore the decoder size must double. Some may argue that com- 
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1 python train.py datasets/en—fr \ 

y —-arch {rnn_type} \ 

3 ——encoder—dropout—out 0.2 \ 
4 —-encoder—layers {n_layers} \ 
5 ——encoder—hidden—size 512 \ 


6 ——encoder—embed—dim 256 \ 

7 —-decoder—layers {n_layers} \ 
8 ——decoder—embed—dim 256 \ 

9 ——decoder—hidden-—size 512 \ 
10 ——decoder—attention False \ 
11 ——decoder—dropout—out 0.2 \ 
12 —optimizer adam —Ir {Ir} \ 
13 ——Ir—shrink 0.5 ——max—epoch 100 \ 
14 —-seed 1 —log-—format json \ 
15 ——num—workers 4 \ 

16 —-batch—size 512 \ 

17 ——weight—decay 0 


Fig. 7.28: Base training configuration for our fairseq model training. The rnn type, 
number of layers, and learning rate (Ir) can be controlled by inserting parameter 
appropriately 


parability between models would only be achieved if the models have the same 
number of parameters. For example, LSTMs have roughly 4x the number of pa- 
rameters as standard RNNs; however, for simplicity and clarity, we maintain a fixed 
hidden representation. In the following figures, each model name takes the form, 
{mn_type}_{Ir}_{num_layers}_{metric}. 


7.7.3.1 RNN Baseline 


First, we investigate the performance of a single layer, unidirectional RNN as a base- 
line for our experiments. We perform a manual grid search on the learning rate to 
find a reasonable starting value. The resulting validation curves for these selections 
are shown in Fig. 7.29. 

The validation curves show how much the learning rate impacts the capacity of 
the model for RNNs, giving drastically different learning curves. 

We also compute our testing results for this model to be used as a comparison 
at the end. Note that the test result is not used in any way to tune or improve our 
models. All tuning is done using the validation set. Any tuning should be done on 
the validation set. The testing result for our best RNN model is: 


i Translated 13486 sentences: 
2 Generate test with beam=1: BLEU4 = 15.46 
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Baseline RNN: Learning Rate Comparison 


=—— baseline_rnn_0.0001_1_valid_loss 
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—— baseline_rnn_0.005_1_valid_loss 
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Fig. 7.29: Validation loss for a single layer RNN with different learning rates on 
English-French translation 


7.7.3.2 RNN, LSTM, and GRU Comparison 


Next, we compare RNN, LSTM, and GRU architectures. We vary the learning rate 
for each, as the dynamics are likely different for each architecture. The validation 
results are shown below in Fig. 7.30. 

Upon inspection, we notice that some configurations take much longer to con- 
verge than others. In particular, with a learning rate of 0.0001, both the GRU and 
LSTM architectures reach the maximum 100 epochs. Secondly, we see that the 
LSTM and GRU architectures converge to lower losses, much faster, and with higher 
learning rates than RNN architectures. The GRU appears to be the best performing 
model here, but both the LSTM and GRU show similar convergence. 


7.7.3.3 RNN, LSTM, and GRU Layer Depth Comparison 


We now compare the effect of depth on each architecture. Here, we vary the depth 
configuration in addition to the learning rate for each architecture. The depths ex- 
plored are 1, 2, and 4-layers deep. Results are displayed in Fig. 7.31. 

Now that we have many models, it becomes more difficult to draw general con- 
clusions about their properties. If we compare the RNN models, we notice that many 
of the configurations converge to a much higher validation loss than either the GRU 
or LSTM architectures. We also observe that the deeper architectures tend to per- 
form well with lower learning rates than their shallower counterparts. Additionally, 
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RNN, GRU, and LSTM Comparison 
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Fig. 7.30: Comparison of single layer RNN, GRU, and LSTM networks on English— 
French translation 


both the LSTM and GRU architectures achieve their best models with a depth of 2 
layers and a learning rate of 0.001. 


7.7.3.4 Bidirectional RNN, LSTM, and GRU Comparison 


Next, we look at the effects of bidirectional models. Many of the models perform 
similarly. Figure 7.32b shows the perplexity of the models predictions (ppl) instead 
of the validation loss. This value is 2!°’S, exaggerating the effects in the graph, which 
can be useful when visibly inspecting curves. 

Once again we see that the LSTM and GRU architectures outperform the RNN 
architectures, with the GRU architecture performing slightly better. 


7.7.3.5 Deep Bidirectional Comparison 


So far, the best performing models have been the 2-layer LSTM and GRU models 
and the single layer bidirectional LSTM and GRU models. Here, we combine the 
two components to see if the benefits are complimentary. In this set of experiments 
we remove the under-performing RNN models for clarity. The results are shown in 
Fig. 7.33. 
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Fig. 7.31: Depth comparison for (a) RNN, (b) LSTM, and (c) GRU architectures 
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a Bidirectional RNN, GRPU and LSTM Comparison 
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Fig. 7.32: Comparison of (a) validation loss and (b) ppl for single layer, bidirectional 
RNN, GRU, and LSTM networks. Note that although the colors are similar, the top 
two lines are RNN models (not GRU models) 


This set of results shows that the 2-layer GRU architecture, with a learning rate 
of 0.001, is the best model in the bidirectional comparison. 


7.7.3.6 Transformer Network 


We now turn our attention to the transformer architecture, where attention is applied 
directly to the input sequence without incorporating recurrent networks. Similar to 
previous experiments, we fix the input and output dimensionality to 256, set 4 atten- 
tion heads in both the encoder and decoder, and fix the fully connected layers size to 
512. We explore a small selection of depths and vary the learning rates accordingly. 
The results are shown in Fig. 7.34, with the 4-layer transformer architecture using a 
learning rate of 0.0005 performing the best. 


7.7 Case Study 361 


Baseline RNN: Learning Rate Comparison 
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Fig. 7.33: Comparison of 2-layer bidirectional GRU and LSTM architectures 


Transformer Comparison 
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Fig. 7.34: Comparison of transformer architectures with different learning rates and 
depths. Note the depth is the same for both the encoder and decoder 
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Model Comparison: Best Models 
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Fig. 7.35: Comparison of best NMT models from previous trials 


7.7.3.7 Comparison of Experiments 


Having explored many types of architectures for machine translation, we now com- 
pare the outputs of each experiment. This set includes the best performing RNN 
from the baseline experiments, the single-layer unidirectional and bidirectional 
GRU, the 2-layer unidirectional and bidirectional GRU, and the 4-layer transformer 
network. Comparing loss of these models on the validation set (Fig.7.35) we see 
that the 4-layer transformer network is our best performer. 


7.7.4 Results 


We now compare the models from each experiment on the test set (Table 7.1). 


Table 7.1: NMT network performance on the test set. The best result is highlighted 


Network type Learning rate BLEU4 
Baseline RNN (1 layer) 0.0005 15.46 
GRU, 1-layer 0.001 36.17 
GRU, 2-layer 0.001 38.53 
GRU, |-layer, bidirectional 0.005 40.63 
GRU, 2-layer, bidirectional 0.001 40.60 


Transformer, 4-layer 0.0005 44.07 
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When we sample outputs from the model (Fig. 7.36) we see that the results look 
pretty good. Notice how the model may produce reasonable translations even though 
it may not predict the target exactly. 


Input: are you surprised ? 
Target: étes - vous surpris ? 
Hypothesis: @étes - vous surprises ? 


Input: i have evidence . 
Target: j' ai des preuves . 
Hypothesis: je dispose de preuves . 


Input: i do n't know how many more times i '1l be able to do this . 
Target: j' ignore combien de fois je serait encore capable de faire ca . 
Hypothesis: je ne sais pas combien de fois je serai capable de faire ¢a . 


Fig. 7.36: Output from the best performing NMT model 


In conclusion, we have shown that for our task it is almost always preferable to 
use GRU or LSTM architectures over base RNNs. Additionally, we have shown that 
the initial learning rate has a significant impact on a model’s quality, even when us- 
ing adaptive learning rate methods. Furthermore, the learning rate needs to be tuned 
for each configuration of the model given the dynamic nature of the deep networks. 
Lastly, deeper networks are not always better. On this dataset, the 2-layer recurrent 
architectures outperformed 4-layer counterparts. And a single-layer, bidirectional 
GRU showed marginal improvements over the 2-layer counterpart on the final test- 
ing set, even though it performed slightly worse on the validation loss comparison. 
These results show the importance of tuning hyperparameters for not only the ap- 
plication, but also the dataset. In real-world applications, it is recommended to tune 
as many hyperparameters as possible to achieve the best result. 


7.7.5 Exercises for Readers and Practitioners 


Other interesting problems for readers and practitioners include: 


1. Add L2 regularization to the training and see if it improves generalization on 
the testing set. 

2. Prune the vocabulary, to remove more infrequent terms (for example, words 
that appear fewer than 20 times). What effect does this have on the training 
(performance, quality)? 

3. Tune the beam search parameter to the validation dataset. What effect does this 
have on the test data? What is the effect on prediction time? 

4. Experiment with tuning other hyperparameters in encoder and decoder. 

5. What would need to be changed to modify the architecture for the question 
answering task? 

6. Initialize the network with pre-trained embeddings 


364 7 Recurrent Neural Networks 


7.8 Discussion 


The results from RNNs on many NLP tasks are quite impressive, achieving state- 
of-the-art results in almost every area. Their effectiveness is remarkable given their 
simplicity. However, in practice, real-world settings require additional considera- 
tions, such as small datasets, lack of diversity in data, and generalization. Following 
is a Short discussion focusing on these concerns and common debates that arise. 


7.8.1 Memorization or Generalization 


All of the deep learning techniques that have been discussed so far come with the 
risk of overfitting. Additionally, many of the academic tasks for various NLP tasks 
are heavily focused on a particular problem with ample data that may not represent a 
real-world task.° The correlations between training and testing data allow some level 
of overfitting to be advantageous to both the validation and testing sets; however, it 
is arguable whether or not these correlations are just representative of the domain 
itself. The difficulty is knowing whether or not the network is memorizing certain 
sequences that are significant to lower the overall cost or learning correlations of un- 
derlying semantic structure of the problem. Some of the symptoms of memorization 
are illustrated in the need for decoding algorithms such as beam search and random 
selection with temperature to produce variety in the output sequences. 

In Ref. [Grel16], Grefenstette explored the question of whether or not recurrent 
networks are capable of learning push-down automate, which is arguably the sim- 
plest form of computation required for natural language. This work cites some of 
the limitations of “simple RNNs” as: 


e Non-adaptive capacity 
e Target sequence modeling dominates training 
e Gradient-starved encoder. 


The suggestion, focused specifically on simple RNNs, was that RNNs are arguably 
only capable of learning finite state machines. 

In Liska et al. [LKB18], the authors studied the ability of RNNs to learn a com- 
position structure, which would show an RNN’s ability to transfer learning from one 
task to another. A small number of the RNNs in the experiment showed that it was 
possible to learn compositional solutions without architectural constraints, although 
many of the RNN attempts were not successful. The results achieved show that gra- 
dient descent and evolutionary strategies may be a compelling direction for learning 
compositional structures. 


> This is not to say that academic benchmarks are not relevant, but rather to point out the importance 
of domain and technological understanding for domain adaptation. 
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7.8.2 Future of RNNs 


One suggestion from Grefenstette’s presentation [Gre16] was to treat recurrence as 
an API. We have seen indications of this suggestion in this chapter already with 
LSTM and GRU cells. In those examples the recurrence API only needs to sat- 
isfy the interaction: given an input and previous state produce an output and up- 
dated state. This abstraction paves the way for a variety of memory-based architec- 
tures such as dynamic memory networks [XMS16] and the stack-LSTM [Dye+15]. 
Future directions point towards adding stacks and queues to have a more interactive 
memory model similar to RAM with architectures such as neural Turing machines 
[GWD 14a]. 
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Chapter 8 ® 
Automatic Speech Recognition te 


updates 





$8.1 Introduction 


Automatic speech recognition (ASR) has grown tremendously in recent years, with 
deep learning playing a key role. Simply put, ASR is the task of converting spoken 
language into computer readable text (Fig. 8.1). It has quickly become ubiquitous 
today as a useful way to interact with technology, significantly bridging in the gap 
in human—computer interaction, making it more natural. Historically, ASR is tightly 
coupled with computational linguistics, given its close connection with natural lan- 
guage, and phonetics, given the variety of speech sounds that can be produced by 
humans. This chapter introduces the fundamental concepts of speech recognition 
with a focus on HMM-based methods. 


Wie! 


Raw Speech Signal Transcription 





Fig. 8.1: The focus of ASR is to convert a digitized speech signal into computer 
readable text, referred to as the transcript 


Simply put, ASR can be described as follows: given an input of audio samples X 
from a recorded speech signal, apply a function f to map it to a sequence of words 
W that represent the transcript of what was said. 


W = f(X) (8.1) 


However, finding such a function is quite difficult, and requires consecutive model- 
ing tasks to produce the sequence of words. 
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These models must be robust to variations in speakers, acoustic environments, 
and context. For example, human speech can have any combination of time variation 
(speaker speed), articulation, pronunciation, speaker volume, and vocal variations 
(raspy or nasally speech) and still result in the same transcript. 

Linguistically, additional variables are encountered such as prosody (rising in- 
tonation when asking a question), mannerisms, spontaneous speech, also known 
as filler words (“um’’s or “uh’’s), all can imply different emotions or implications, 
even though the same words are spoken. Combining these variables with any num- 
ber of environmental scenarios such as audio quality, microphone distance, back- 
ground noise, reverberation, and echoes exponentially increases the complexity of 
the recognition task. 

The topic of speech recognition can include many tasks such as keyword spotting, 
voice commands, and speaker verification (security). In the interest of concision, we 
focus mainly on the task of speech-to-text (STT), specifically, large vocabulary con- 
tinuous speech recognition (LVCSR) in this chapter. We begin by discussing error 
metrics commonly used for ASR systems. Next, we discuss acoustic features and 
processing, as well as phonetic units used for speech recognition. These concepts 
are combined as we introduce statistical speech recognition, the classical approach 
to ASR. We then introduce the DNN/HMM hybrid model, showing how the classi- 
cal ASR pipeline incorporates deep learning. At the end of the chapter, a case study 
compares two common ASR frameworks. 


$.2 Acoustic Features 


The selection of acoustic features for ASR is a crucial step. Features extracted from 
the acoustic signal are the fundamental components for any model building as well 
as the most informative component for the artifacts in the acoustic signal. Thus, the 
acoustic features must be descriptive enough to provide useful information about 
the signal, as well as resilient enough to the many perturbations that can arise in the 
acoustic environment. 


8.2.1 Speech Production 


Let us first begin with a quick overview of how humans produce speech. While a 
full study of the anatomy of the human vocal system is beyond the scope of this 
book, some knowledge of human speech production can be helpful. The physical 
production of speech consists of changes in air pressure that produces compression 
waves that our ears interpret in conjunction with our brain. Human speech 1s created 
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from the vocal tract and modulated with the tongue, teeth, and lips (often referred 
to as articulators): 


e Air is pushed up from the lungs and vibrates the vocal cords (producing quasi- 
periodic sounds). 
The air flows into the pharynx, nasal, and oral cavities. 
Various articulators modulate the waves of air. 
Air escapes through the mouth and nose. 


Human speech is usually limited to the range 85 Hz—8 kHz, while human hearing is 
in the range 20 Hz—20 kHz. 


8.2.2 Raw Waveform 


The waves of air pressure produced are converted into a voltage via a microphone 
and sampled with an analog-to-digital converter. The output of the recording process 
is a 1-dimensional array of numbers representing the discrete samples from the dig- 
ital conversion. The digitized signal has three main properties: sample rate, number 
of channels, and precision (sometimes referred to as bit depth). The sample rate is 
the frequency at which the analog signal is sampled (in Hertz). The number of chan- 
nels refers to audio capture with multiple microphone sources. Single-channel audio 
is referred to as monophonic or mono audio, while stereo refers to two-channel au- 
dio. Additional channels such as stereo and multi-channel audio can be useful for 
signal filtering in challenging acoustic environments [BW 13]. The precision or bit 
depth is the number of bits per sample, corresponding to the resolution of the infor- 
mation. 

Standard telephone audio has a sampling rate of 8 kHz and 16-bit precision. CD 
quality is 44.1 kHz, 16-bit precision, while contemporary speech processing focuses 
on 16 kHz or higher. 

Sometimes bit rate is used to measure the overall quality of audio computed by: 


bit rate = sample rate x precision x number of channels. (8.2) 


The raw speech signal is high dimensional and difficult to model. Most ASR 
systems rely on features extracted from the audio signal to reduce the dimensionality 
and filter unwanted signals. Many of these features come from some form of spectral 
analysis that converts the audio signal to a set of features that strengthen signals 
that mimic the human ear. Many of these methods depend on computing a short 
time Fourier transform (STFT) on the audio signal using FFT, filter banks, or some 
combination of the two [PVZ13]. 
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8.2.3 MFCC 


Mel frequency cepstral coefficients (MFCC) [DM90] are the most commonly used 
features for ASR. Their success relies upon their ability to perform similar types of 
filtering that correlates to the human auditory system and their low dimensionality. 

There are seven steps to computing the MFCC features [MBE10]. The overall 
process is shown in Fig. 8.2. These steps are similar for most feature generation 
techniques, with some variability in the types of filters that are used and the filter 
banks applied. We discuss each step individually: 


. Pre-emphasis 

. Framing 

. Hamming windowing 

. Fast Fourier transform 

. Mel filter bank processing 

. Discrete cosine transform (DCT) 
. Delta energy and delta spectrum. 
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Fig. 8.2: Diagram of MFCC processing with a visual representation for various parts 
of the process. All spectrograms and features are shown in log-space 
8.2.3.1 Pre-emphasis 


Pre-emphasis is the first step in MFCC feature generation. In speech production 
(and signal processing in general), the energy of higher frequency signals tends to 
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be lower. Pre-emphasis processing applies a filter to the input signal that emphasizes 
the amplitudes of higher frequencies and lowers the amplitudes of lower frequency 
bands. For example: 


Yt = Xt — AXz—-]1 (8.3) 


would make the output less dependent on a strong signal from the previous time 
Steps. 


8.2.3.2 Framing 


The acoustic signal is perpetually changing in speech. Modeling this changing signal 
is done by treating small segments sampled from the audio as stationary. Framing 
is the process of separating the samples from the raw audio into fixed length seg- 
ments referred to as frames. These segments are converted to the frequency domain 
with an FFT, yielding a representation of the strength of frequencies during each 
frame. The segments signify the boundaries between the phonetic representations 
of speech. The phonetic sounds associated with speech tend to be in the range of 
5—100 ms, so the length of frames is usually chosen to account for this. Typically, 
frames are in the range of 20 ms for most ASR systems, with a 10 ms overlap, yield- 
ing a resolution of 10 ms for our frames. 


8.2.3.3 Windowing 


Windowing multiplies the samples by a scaling function. The purpose of this func- 
tion is to smooth the potentially abrupt effects of framing that can cause sharp differ- 
ences at the edges of frames. Applying windowing functions to the samples there- 
fore tapers the changes to the segment to dampen signals near the edges of the frame 
that may have harsh effects after the application of the FFT. 

Many windowing functions can be applied to a signal. The most commonly used 
for ASR are Hann windowing and Hamming windowing. 

Hann window: 


2 7 27n —.9f an 
w(n) = 0.5 (1 cos (= “)) = sin a :] (8.4) 


Hamming window: 





27n 
= 0.54 — 0.46 —— 8.5 
w(n) cos € = ) (8.5) 


where N is the window length andO <n<N-—1. 
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8.2.3.4 Fast Fourier Transform 


A short-time Fourier transform (STFT) converts the 1-dimensional signal from the 
time domain into the frequency domain by using the frames and applying a discrete 
Fourier transform (DFT) to each. An illustration of the DFT conversion is shown in 
Fig. 8.3. The fast Fourier transforms (FFT) is an efficient algorithm to compute the 
DFT under suitable circumstances and is common for ASR. 





Fig. 8.3: The desired effect of an FFT on an input signal (shown on the left) and the 
normalized FFT output in the frequency domain (shown on the right) 


The spectrogram is a 3-dimensional visual FFT transformation of the acoustic 
signal and is often a valuable set of features itself. The STFT representation can 
be advantageous because it makes the fewest assumptions about the speech signal 
(aside from the raw waveform). For some end-to-end systems, the spectrogram is 
used as input, because it provides a higher resolution frequency description. The 
plot itself, shown in Fig. 8.4, has time along the x-axis, the frequency bins on the 
y-axis, and the intensity of that frequency in the z-axis, which is usually represented 
by the color. 

The magnitude spectrogram can be computed by: 


Sim = \FFT (x) |? (8.6) 


The power spectrogram is sometimes more helpful because it normalizes the 
magnitude by number of points considered 


_ |FFT(xi)|? 


Sp N 


(8.7) 


where WN is the number of points considered for the FFT computation (typically 256 
or 512). 
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Fig. 8.4: Log spectrogram of an audio file 


Most of the significant frequencies are in the lower portion of the frequency 
spectrum, so the spectrogram is typically mapped into the log scale. 


8.2.3.5 Mel Filter Bank 


The features created from the STFT transformation of the audio aim to simulate 
conversions made by the human auditory system processes. The Mel filter bank is a 
set of bandpass filters that mimic the human auditory system. Rather than follow a 
linear scale, these triangular filters act logarithmic at higher frequencies and linear 
at lower frequencies, which 1s typical in speech signals. Figure 8.5 shows the Mel 
filter bank. The filter bank usually has 40 filters. 

The conversion between the Mel (m) and Hertz (f) domains can be accomplished 
by: 


if 
— 25951 1+ — 
m 9 0810 ( a i 700 


f = 700 (1085 — 1) 


(8.8) 


Each of the filters produces an output that is the weighted sum of the spectral 
frequencies that correspond to each filter. These values map the input frequencies 
into the Mel scale. 


8.2.3.6 Discrete Cosine Transform 


The discrete cosine transform (DCT) maps the Mel scale features into the time do- 
main. The DCT function is similar to a Fourier transform but uses only real numbers 
(a Fourier transform produces complex numbers). It compresses the input data into 
a set of cosine coefficients that describe the oscillations in the function. The output 
of this conversion is referred to as the MFCC. 
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Fig. 8.5: Mel filter bank shown with 16 filters. The filters are applied to the input 
signal to produce the Mel-scale output 


8.2.3.7 Delta Energy and Delta Spectrum 


The delta energy (delta) and delta spectrum (also known as “delta delta” or “dou- 
ble delta’) features provide information about the slope of the transition between 
frames. The delta energy features are the difference between consecutive frames’ 
coefficients (the current and previous frames). The delta spectrum features are the 
difference between consecutive delta energy features (the current and previous delta 
energy features). The equations for computing the delta energy and delta spectrum 
features are: 


> N(Cr+n = C7) 


d; a 
2>y_, 7? 


(8.9) 


yy n( din — di—n) 


dd; = 
oy 4 n2 


(8.10) 


8.2.4 Other Feature Types 


Many acoustic features have been proposed over the years, applying different filters 
and transforms to highlight various aspects of the acoustic spectrum. Many of these 
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approaches relied on hand engineered features such as MFCCs, gammatone features 
[Sch+07], or perceptual linear predictive coefficients [Her90]; however, MFCCs 
remain the most popular. 

One of the downsides of MFCC features (or any manually engineered feature set) 
is the sensitivity to noise due to its dependence on the spectral form. Low dimen- 
sionality of the feature space was highly beneficial with earlier machine learning 
techniques, but with deep learning approaches, such as convolutional neural net- 
works, higher resolution features can be used or even learned. 

Overall MFCC features are efficient to compute, apply useful filters for ASR, 
and decorrelate the features. They are sometimes combined with additional speaker- 
specific features (typically i-vectors) to improve the robustness of the model. 


8.2.4.1 Automatically Learned 


Various attempts have been tried to learn the feature representations directly, rather 
than relying on engineered features, which may not be best for the overall task of re- 
ducing WER. Some of the approaches include: supervised learning of features with 
DNNs [Tiis+14], CNNs on raw speech for phone classification [PCD13], combined 
CNN-DNN features [HWW 15], or even unsupervised learning with RBMs [JH11]. 

Automatically learned features improve quality in specific scenarios but can also 
be limiting across domains. Features produced with supervised training learn to 
distinguish between the examples in the dataset and may be limited in unobserved 
environments. With the introduction of end-to-end models for ASR, these features 
are tuned during the end-to-end task alleviating the two-stage training process. 


$8.3 Phones 


Following from NLP, the most logical linguistic representation for transforming 
speech into a transcript may seem to be words, ultimately because a word-level tran- 
script is the desired output and there is meaning attached at the word-level. Practi- 
cally speaking, however, speech datasets tend to have few transcribed examples per 
word, making word-level modeling difficult. A shared representation for words is 
desirable, to obtain sufficient training data for the variety of words that are possible. 
For example, phonemes can be used to phonetically discretize words in a particular 
language. Swapping one phoneme with another changes the meaning of the word 
(although this may not be the case for the same phonemes in another language). For 
example, if the third phone in the word sweet [swit] is changed from [i] to [e], the 
meaning of the whole word changes: sweat |swet]. 

Phonemes, themselves, tend to be too strict to use practically due to the attach- 
ment of meaning. Instead phones are used as a phonetic representation for the 
linguistic units (with potentially multiple phones mapping to a single phoneme). 
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Phones do not map to any specific language, but rather, are absolute to speech 
itself, distinguishing sounds that signify speech. Figure 8.6 shows the phone set 
for English. 


AA AY EH HH L OY T W 
AE B ER IH M P TH Y 
AH CH BY ipa N R UH Z 
AO D F JH NG S UW ZH 
AW DH G K OW SH V 


Fig. 8.6: English phone set, based on the ARPAbet symbols for ASR as used in the 
CMU Sphinx framework. The phone set is made up of 39 phones 


With phones, words are mapped to their phonetic counterpart by using a phonetic 
dictionary similar to the one shown in Fig. 8.7. A phonetic entry should be present 
for each word in the vocabulary (sometimes more than one entry if there are multiple 
ways to pronounce a word). By using phones to represent words, the shared repre- 
sentations can be learned from many examples across words, rather than modeling 
the full words. 


Word Phone Representation 
a AH 

aardvark AARDVAARK 
aaron EH R AHN 

aarti AA RTIY 

zygote Z AY GOW T 


Fig. 8.7: Phonetic dictionary for supported words in an ASR system. Note: the stress 
of the syllable is sometimes included adding an additional features to the phone 
representations 


If every word were pronounced with the same phones, then a mapping from the 
audio to the set of phones to words would be a relatively straight-forward transfor- 
mation. However, audio exists as a continuous stream, and a speech signal does not 
necessarily have defined boundaries between the phone units or even words. The sig- 
nal can take many forms in the audio stream and still map to the same interpretable 
output. For example, the speaker’s pace, accent, cadence, and environment can all 
play significant roles in how to map the audio stream into an output sequence. The 
words spoken depend not only on the phone at any given moment, but also on the 
states that have come before and after the context. This natural dynamic in speech 
places a strong emphasis on the dependency of the surrounding context and phones. 

Combining phone states is a common strategy to improve quality, rather than 
relying on their canonical representations. Specifically, the transitions between 
words can be more informative than single phone states. In order to model this, 
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diphones—parts of two consecutive phones, triphones, or extended to senones 
(triphone context-dependent units) can be used as the linguistic representation or 
intermediary rather than phones themselves. Many methods exist for combining 
the phone representations with additional context, modeling them directly or by 
learning a statistical hierarchy of the state combinations, and most traditional ap- 
proaches rely on these techniques. 

Although ASR focuses on recognition rather than interpretation (e.g., the ac- 
curacy on recognizing spoken words rather than context-dependent word sequence 
modeling), the contextual understanding is an important aspect. In the case of ho- 
mophones, two words with the same phonetic representation and different spellings, 
predicting the correct word relies entirely on the surrounding context. In this case, 
some of the issues can be overcome with a language model, discussed later. Incor- 
rect phonetic substitutions further complicate matters. For example, in English, the 
representations of pin [P IH N] and pen [P EH N] are distinct. However, although 
these words do have different phonetic representations, they are commonly mistak- 
enly said interchangeably or pronounced similarly, requiring the correct selection 
to depend on the context more so than the phones themselves. With the inclusion 
of accents, phonetic representations can contain even more conflicts, requiring al- 
ternative methods to determine speaker-specific features. These types of scenarios 
are crucial in ASR, for there are many times that humans may say the wrong word, 
and yet the context and intent can still be interpreted. All of these real-world factors 
of spoken language contribute the complexity of automatic speech recognition in 
practice. 


8.4 Statistical Speech Recognition 


Statistical ASR focuses on predicting the most probable word sequence given a 
speech signal, via an audio file or input stream. Early approaches did not use a 
probabilistic focus, aiming to optimize the output word sequence by applying tem- 
plates for reserved words to the input acoustic features (this was historically used 
for recognizing spoken digits). Dynamic time warping (DTW) was an early way to 
expand this templating strategy by finding the “lowest constrained path” for the tem- 
plates. This approach allowed for variations in the input time sequence and output 
sequence; however, it was difficult to come up with appropriate constraints, such as 
distance metrics, how to choose templates, and the lack of a statistical, probabilistic 
foundation. These drawbacks made the DTW-templating approach challenging to 
optimize. 

A probabilistic approach was soon formed to map the acoustic signal to a word 
sequence. Statistical sequence recognition introduced a focus on maximum posterior 
probability estimation. Formally, this approach is a mapping from a sequence of 
acoustic, speech features, X, to a sequence of words, W. The acoustic features are a 
sequence of feature vectors of length T: X = {x, € R?|t = 1,...,7}, and the word 
sequence is defined as W = {w, € Vin=1,...,N}, having a length N, where V is the 
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vocabulary. The most probable word sequence W* can be estimated by maximizing 
P(W|X ) for all possible word sequences, V*. Probabilistically this can be written as: 


W* = argmax P(W|X) (8.11) 
Wev* 


Solving this quantity is the center of ASR. Traditional approaches factorize this 
quantity, optimizing models to solve each component, whereas more recent end-to- 
end deep learning methods focus on optimizing for this quantity directly. 

Using Bayes’ theorem, statistical speech recognition is defined as: 


P(X|W)P(W) 
P(X) 


P(W|X) = (8.12) 


The quantity P(W) represents the language model (the probability of a given 
word sequence) and P(X |W) represents the acoustic model. Because this equation 
drives the maximization of the numerator to achieve the most likely word sequence, 
the goal does not depend on P(X), and it can be removed: 


W* = argmax P(X|W )P(W) (8.13) 
Wev* 


An overview of statistical ASR is illustrated in Fig. 8.8. 
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Model P(W)P (X/ W) 


Fig. 8.8: Diagram of statistical speech recognition 


Often, one of the most challenging components of speech recognition is the sig- 
nificant difference between the number of steps in the input sequence compared to 
the output sequence (T >> N). For example, extracted acoustic features may repre- 
sent a 10 ms frame from the audio signal. A typical ten-word utterance could have 
a duration of 3-s utterance, leading to an input sequence length of 300 and a target 
output sequence of 10 [You96]. Thus, a single word can spread many frames and 
take a variety of forms, as shown in Fig. 8.9. It is, therefore, sometimes beneficial to 
split a word into sub-components that span fewer frames. 
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Fig. 8.9: Spectrogram of a 16kHz speech utterance, reciting the letters “D A V 
I D.” The spectrogram has been created with 20 ms frames with a 10 ms overlap, 
yielding an spectrogram size of 249 x 161. The output sequence has a length of 5 
corresponding to each of the characters in the vocabulary 


8.4.1 Acoustic Model: P(X|W) 


The statistical definition in Eq. (8.13) can be augmented to incorporate the mapping 
acoustic features to phones and then from phones to words: 


W* = argmax P(X|W)P(W) 
Ww 
= argmax } P(X, S|W)P(W) (8.14) 
W S 


~~ argmax P(X|S)P(S|W)P(W) 
W,S 


where P(X|S) maps the acoustic features to phone states and P(S|W) maps phones 
to words (commonly referred to as the pronunciation model). 

Equation (8.13) showed two factors P(X|W) and P(W). Each of these factors 
are considered models and therefore have learnable parameters, ©, and ©,, for the 
acoustic model and language model, respectively. 


W* = argmax P(X |W, @,)P(W, @;) (8.15) 
Wev* 


This model now depends on predicting the likelihood of observations X, with the 
factor P(X |W, ©,). Solving this quantity requires a state-based modeling approach 
(HMMs). If a discrete-state model is assumed, the probability of an observation can 
be defined by introducing a state sequence S, where S = {s,; € {s,...s©Q}|r = 
1,...,7} into P(X|W). 


P(X|W) = 9 P(X|S)P(S|W) (8.16) 
S 
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Equation 8.16 can be factorized further using the chain rule of probability to pro- 


duce the framewise likelihood. For notational convenience, let Xj., = X1,X2,...,Xp. 
T 
P(X|S) = [ [PG |x12-1,) (8.17) 


t=1 
Using the conditional independence assumption, this quantity can be reduced to: 


T 
P(X|S) = | [ P(x |s:) (8.18) 


t=1 


The conditional independence assumption limits the context that is considered for 
prediction. We assume that any observation x; is only dependent on the current 
state, s;, and not on the history of observations x;.;-;, as shown in Fig. 8.10. This 
assumption reduces the computational complexity of the problem; however, it limits 
the contextual information included in any decision. The conditional independence 
assumption is often one of the biggest hurdles in ASR, due to the contextual nature 
of speech. Thus, a variety of techniques are centered around providing “context 
features” to improve quality. 


s S1|| S2|| $3|| S4|| S5|| Sg|| $7) | Sg|| -.. States 


Observed 
features 





Fig. 8.10: State alignment with feature observations 


The conditional independence assumption allows us to compute the probability 
of an observation by summing over all possible state sequences S because the ac- 
tual state sequence that produced X is never known. The set of states Q can vary 
depending on the modeling approach of the ASR system. In a simple system, the 
target states are sub-word units (such as English phones). 

The transition alignments between frames are not known beforehand. We use an 
HMM, allowing us to learn the temporal dilation, and train it using the expectation 
maximization (EM) algorithm. In general, the EM algorithm estimates the state oc- 
cupation probabilities with the current HMM parameters and then re-estimates the 
HMM parameters based on the estimation. 
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An HMM is composed of two stochastic processes: a hidden part that is a Markov 
chain and an observable process that is probabilistically dependent on the Markov 
chain. The aim is to model probability distributions of the states that produce the 
observable events, which are acoustic features. Formally, the HMM is defined by: 


1 A set of QO states S = fs, ...,82}. The Markov chain can only be in one state at 
a time. In a simple ASR model, the state set S could be the set of phones for the 
language. 

2 The initial state probability distribution, 2 = {P(s |r = 0)}, where t is the time 
index. 

3. A probability distribution that defines the transitions between states: 
a= P(s ) s?,). The transition probabilities a;; are independent of time f. 

4 Observations X from our feature space F’.. In our case, this feature space can be 
all continuous acoustic features that are input into our model. These features are 
given to us by the acoustic signal. 

5 A set of probability distributions, emission probabilities (sometimes referred to 
as output probabilities). This set of distributions describe the properties of the 
observations yielded by each state, i.e., b, = {b;(x) = P(x|s)} 


e Emission distributions: b, = P(x|s) 
e Transition probabilities: aj; = P(s;|s;—1) 
e Initial state probabilities: 7 = P(s,). 


The transitions between states s, only depend on the previous state s;_,;. The lexicon 
model (discussed in the next section) provides the initial transition state probabili- 
ties. These transitions can be self-loops, allowing the time dilation that is necessary 
to allow elasticity in the frame-based prediction. 

The HMM is optimized, learning 7, a, and b(x) by training on the acoustic 
observations X and the target sequence of phone states Y. An initial estimate for 
P(s\)|s) can be obtained from the lexicon model, P(S|W). The forward-recursion 
algorithm is used to score the current model parameters a, b(x), yielding parameters 
to obtain P(X|S). The Viterbi algorithm is used to avoid computing the sum of all 
paths and serves as an approximation of the forward algorithm: 


P(X|S)= ¥& P(X,S|A) Baum-Welch 


{ path; } (8. 19) 
~ max P(X,S|A) Viterbi 
path, 


Training is typically accomplished by the forward—backward (or Baum—Welch) 
and Viterbi algorithms [Rab89b]. In our case, the emission probabilities target max- 
imizing the probability of the sample given the model. Because of this, the Viterbi 
algorithm focuses only on the most likely path in the set of possible state sequences 
(Fig. 8.11). Modeling the emission probability density function is usually accom- 
plished using a Gaussian or mixture of Gaussians. 
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Fig. 8.11: All possible state transitions to produce the 3-phone word, “cat” for an 
8-frame utterance. The Viterbi path applied to the possible state transitions is shown 
in red 


8.4.1.1 LexiconModel : P(S|W) 


A model for P(S|W) can be constructed by representing the probability of a state 
sequence given a word sequence. This model is commonly referred to as the pro- 
nunciation or lexicon model. We factorize this using the probabilistic chain rule to 


obtain: 
T 


P(S|W) = |] P(s:|s12-1,W) (8.20) 
t=1 
Once again, using the conditional independence assumption, this quantity is ap- 
proximated by: 
P(S;|S1:1-1,W) = P(s;|s;-1,W) (8.21) 


The introduction of the conditional independence assumption is also the first- 
order Markovian assumption, allowing us to implement the model as a first-order 
HMM. The states of the model, s;, are not directly observable; therefore, we are not 
able to observe the transition from one phonetic unit to another; however, the obser- 
vations x; do depend on the current state s;. The HMM allows us to infer information 
about the state sequence from the observations. 

First, the word vocabulary V is converted into the state representations for each 
term to create a word model. 

The lexicon model can be used to determine the initial probability of each state 
P(s;) by counting the occurrence rate for the beginning of each word. The transition 
probabilities accumulate over the lexical version of the transcript targets for the 
acoustic data. 

A state-based word-sequence model can be created for each word in the vocabu- 
lary, as shown in Fig. 8.12. 
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oe os 
OOOO 
02 O41 Of 
Fig. 8.12: Phone state model for the 3-phone word “cat” with transition probabilities 


8.4.2 LanguageModel : P(W) 


The language model P(W) is typically an n-gram language model leveraging the 
probabilistic chain rule. It is factorized by using the conditional independence as- 
sumption except with an (m-— 1)-th order Markov assumption, where m is the num- 
ber of grams to be considered. It can be described as: 


x 
S 

|| 
|= 


P(Wn [Wy -Wn—1 ) 


> 
| 
— 


(8.22) 


d? 
|= 


P(wn lWn—m—I:n—1 ) 


> 
| 
— 


HMMs are robust models for training and decoding sequences. When training 
the HMM models, we focused on training individual models for the state align- 
ments, and then combine them into a single HMM for continuous speech recog- 
nition. HMMs also allow the word sequence to be incorporated, creating a state 
sequence based on words and apply the word sequence priors as well. Furthermore, 
HMMs support compositionality; therefore, the time dilation, pronunciation, and 
word sequences (grammar) are handled in the same model by composing the indi- 
vidual components: 


P(q|M) = P(q,9,w|M) 
— P(q|o) -P(o|w) -P(Wr|Ww-w,_1) 


Unfortunately, many of the assumptions that are needed to optimize HMMs limit 
their functionality due to [BM12]: 


(8.23) 


e HMM and DNN models are trained independently of each other and yet depend 
on each other. 

e A priori choices of statistical distributions rely on linguistic information from 
handcrafted pronunciation dictionaries. These are subject to human error. 

e The first-order Markov assumption often referred to as the conditional indepen- 
dence assumption (states are only dependent on their previous state) forces strict 
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limitations on the number of context states considered for an individual predic- 
tion. 
e The decoding process is complex. 


8.4.3 HMM Decoding 


The decoding process for an HMM-based ASR model finds the optimal word se- 
quence, combining the various models. The process decodes a state sequence from 
the acoustic features initially and then decodes to the optimal word sequence from 
the state sequence. Phonetic decoding has traditionally relied on interpreting the 
HMMs probability lattice constructed for each word from the phonetic lexicons ac- 
cording to the acoustic features. Decoding can be done using the Viterbi algorithm 
on the HMM output lattice, but this is expensive for large vocabulary tasks. Viterbi 
decoding performs an exact search efficiently, making it infeasible for a large vo- 
cabulary task. Beam search is often used instead to reduce the computation. The 
decoding process uses backtracking to keep track of the word sequence produced. 





Fig. 8.13: (a) HMM state representation, (b) phone state transitions for the word 
“data,” and (c) grammar state model. This figure is adapted from [MPRO8] 
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During prediction, decoding the HMM typically relies on using weighted au- 
tomata and transducers. In a simple case, weighted finite state acceptors (WFSA), 
the automata are composed of a set of states (initial, intermediate, and final), a set 
of transitions between states with a label and weight, and final weights for each fi- 
nal state. The weights express the probability, or cost, of each transition. You can 
express HMMs in the form of finite state automata. In this approach, a transition 
connects each state. WFSA accept or deny possible decoding paths depending on 
the states and the possible transitions. The topology could represent a word, the pos- 
sible word pronunciation(s), or the probabilities of the states in the path to result in 
this word, (Fig. 8.13). Decoding, therefore, depends on combining the state models 
from the HMM with the pronunciation, dictionary, and n-gram language models that 
must be combined in some way. 

Usually, weighted finite state transducers (WFST) are used to represent the dif- 
ferent levels of state transition in the decoding phase [MPRO8]. WFSTs transduce 
an input sequence to an output sequence. WFSTs add an output label, which can be 
used to tie different levels of the decoding relationships together, such as phones and 
words. A WESA is a WEST without the output label. The WFST representation al- 
lows models to be combined and optimized jointly via its structural properties with 
efficient algorithms: compositionality, determinism, and minimization. The compo- 
sition property allows for different types of WFSTs to be constructed independently 
and composed together, such as combining a lexicon (phones to words) WFST and a 
probabilistic grammar. Determinism forces unique initial states, where no two tran- 
sitions leaving a state share the same input label. Minimization combines redundant 
states and can be thought of as suffix sharing. Thus, the whole decoding algorithm 
for a DNN-HMM hybrid model can be represented by WFSTs via four transducers: 


HMM: mapping HMM states to CD phones 
Context-dependency: mapping CD phones to phones 
Pronunciation lexicon: mapping phones to words 
Word-level grammar: mapping words to words. 


In Kaldi, for example, these transducers are referred to as H, C, L, and G, re- 
spectively. Compositionality allows a composition between L and G into a single 
transducer, L G, that maps phone sequences to a word sequence. Practically, the 
composition of these transducers may grow too large, so the conversion usually 
takes the form: HCLG, where 


HCLG = min(det(H o min(det(C 0 min(det(LoG)))))). (8.24) 


$8.5 Error Metrics 


The most commonly used metric for speech recognition is word error rate (WER). 
WER measures the edit distance between the prediction and the target by consid- 
ering the number of insertions, deletions, and substitutions, using the Levenshtein 
distance measure. 
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Word error rate is defined as: 


I+D+S 
WER = 100 x on (8.25) 


where 


I is the number of word insertions, 

D is the number of word deletions, 

Sis the number of word substitutions, and 
N 1s the total number of words in the target. 


For character-based models and character-based languages, error metrics focus 
on CER (character error rate), sometimes referred to as LER (letter error rate). 
Character-based models will be explored more in Chap. 9: 


I+D+4+S 
CER = 100x oa (8.26) 


where 


I is the number of character insertions, 

D is the number of character deletions, 

S is the number of character substitutions, and 
N 1s the total number of characters in the target. 


CER and WER are used to identify how closely a prediction resembles its target, 
giving a measurement of the overall system. They are straight-forward to compute 
and give a straight-forward summary of the recognition system’s quality. Figure 8.14 
shows the scripts to compute WER and CER. A few examples of WER and CER 
are shown in Figs. 8.15, 8.16, 8.17. 

One of the drawbacks to edit distance metrics, however, is that they do not give 
any indication of what the errors might be. Measuring specific types of errors, there- 
fore, would require additional investigation for improving models, such as SWER 
(salient word error rate) or looking at concept accuracy. In [MMG04], the authors 
suggested improvements to the WER metric in the form of MER (match error rate) 
and WIR (word information loss). These metrics can be useful when the informa- 
tion communicated is more important than the edit cost, with the added benefit of 
providing probabilistic interpretations (as WER can be greater than 100). 


8.6 DNN/HMM Hybrid Model 


GMMs were a popular choice because they are capable of modeling P(x;|s,) di- 
rectly. Additionally, they provide a probabilistic interpretation of the input, mod- 
eling the distribution under each state. However, the Gaussian distribution at each 
state is a strong assumption itself. In practice, the features may be strongly non- 
Gaussian. DNNs showed significant improvements over GMMs with their ability 
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word2char = dict(zip(b, range(len(b)))) 
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Fig. 8.14: Python functions to compute WER and CER 


to learn non-linear functions. The DNN cannot provide the conditional likelihood 
directly. The framewise posterior distribution is used to turn the probabilistic model 
of P(x;|s,) into a classification problem P(s;|x;) using the pseudo-likelihood trick as 
an approximation of the joint probability. The application of the pseudo-likelihood 
is referred to as the “hybrid-approach.” 
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Fig. 8.15: An exact match between the prediction and the target yields a WER and 


CER of 0 
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Fig. 8.16: Changing one character of the predicted word yields a larger increase in 
WER because the entire word is wrong, albeit phonetically similar. The change in 
CER is much smaller by comparison because there are more characters than words; 
thus a single character change has less effect 
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Fig. 8.17: WER and CER are typically not treated as percentages, because they can 
exceed 100%. The loss or insertion of large sections can greatly increase the WER 


and CER 
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The numerator is a DNN classifier, trained with a set of input features as the in- 
put x, and target state s;. In a simple case, if we consider |-state per phone, then 
the number of classifier categories will be len(q). The denominator P(s,) is the 
prior probability of the state s,. Note that training the framewise model requires 
framewise alignments with x; as the input and s; as the target, as shown in Fig. 8.18. 
This alignment is usually created by leveraging a weaker HMM/GMM alignment 
system or through human-created labels. The quality and quantity of the alignment 
labels are typically the most significant limitations with the hybrid-approach. 
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Fig. 8.18: In order to apply the DNN classifier, a framewise target must exist. A 
constrained alignment is computed using an existing classifier to align the acoustic 
features and the known sequence of states 


Classifier construction requires the selection of target states (words, phones, tri- 
phone states). Selection of the states, Q, can make a significant difference in the 
quality and complexity of the task. First, it must support the recognition task to 
obtain the alignments. Second, it must be practical for the classification task. For 
example, phones may be easier to train a classifier for, however, getting framewise 
labels for the training data and a decoding scheme may be much more difficult to 
obtain. Alternatively, word-based states are straight-forward to create but are harder 
to get framewise alignments and train a classifier. 


8.7 Case Study 


In this case study, the main focus is on training ASR systems using open source 
frameworks. We begin by training a traditional ASR engine and then move towards 
the more advanced models in the frameworks, ending with a TDNN model. 
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8.7.1 Dataset: Common Voice 


In this case study we focus on building ASR models for the Common Voice! 


dataset released by Mozilla. Common Voice is a 500h speech corpus of recorded 
speech from text. It is composed of crowdsourced speakers recoding one utterance 
per example. These recordings are then peer-reviewed to assess the quality of the 
transcript-recording pair. Depending on the number of positive and negative votes 
that each utterance receives, it is labeled as either valid, invalid, or other. The valid 
category contains samples that have had at minimum two reviews, and the majority 
confirms the audio matches the text. The invalid category similarly has had at least 
two reviews, with the majority confirming the audio does not match the text. The 
other category contains all the files with less than two votes or with no majority con- 
sensus. Each of the sub-groups, valid and other, is further split into train, test, and 
dev (validation). The “cv-valid-train” dataset contains a total of 239.81 h of audio 
in total. Overall, the dataset is complex, containing a variety of accents, recording 
environments, ages, and genders. 


8.7.2 Software Tools and Libraries 


Kaldi is one of the most widely used toolkits for ASR, developed mainly for re- 
searchers and professional use. It is developed primarily by Johns Hopkins Univer- 
sity and built entirely in C++ with shell scripts tying various components of the 
library together. The design focuses on providing a flexible toolkit that can be mod- 
ified and extended to the task. Collections of scripts referred to as “recipes” are used 
to connect the components to perform training and inference. 

CMU Sphinx is an ASR toolkit that developed at Carnegie Mellon University. 
It also relies on HMM-based speech recognition and n-gram language models for 
ASR. There have been various releases of the Sphinx toolkit with Sphinx 4 being the 
most current. A version of Sphinx called PocketSphinx is more suitable for embed- 
ded systems (typically it comes with some losses in quality due to the restrictions of 
the hardware). 


8.7.3 Sphinx 


In this section, we train a Sphinx ASR model on the Common Voice dataset. This 
framework relies on a variety of packages, mainly based on C++. In light of this, 
much of the work in this ASR introduction chapter focuses on the scripts and asso- 
ciated concepts, namely the data preparation. Like many frameworks, once the data 
is appropriately formatted, the framework is relatively straight-forward. 


 https://voice.mozilla.org/en/data. 


8.7 Case Study 393 


8.7.3.1 Data Preparation 


The data preparation required for Sphinx is the most important step. Sphinx is con- 
figured to look in specific places for certain things and expects consistency between 
files and file names. The conventional structure is a top level directory with the same 
name as the dataset. This name is used as the file name for the subsequent files. In- 
side this directory, there are two directories “wav” and “etc.” The “wav” directory 
contains all training and testing audio files in wav form. The “etc” directory contains 
all configuration and transcript files. Figure 8.19 shows the file structure. 


1 /common_voice 

2 ervey 

3 COMMON=\V Oley dic 

4 COMMmMOoOn-Volces tiller 

5 common_voice.idngram 
6 common_voice.lm 

7 common_voice.Im. bin 
8 common_voice. phone 

9 common_voice. vocab 


10 cOmMmOon=y O1lce=fest. filicid s 

11 COMMON=EYOICC@leStstransceription 
12 COMMOn= vOICe =i raime iieads 

13 common_voice_train. transcription 
14 wav / 

15 train_sample000000 .wav 

16 test_sample000000 .wav 

17 train_sample000001 .wav 

18 test_sample000001 .wav 


Fig. 8.19: Files that are created for Sphinx 


Common Voice initially comes packaged with “mp3” files. These will be con- 
verted to “wav” by using the “SoX” tool.? The processing script is shown in 
Fig. 8.20. 

After we create the ““wav’”’ files, we create a list of all files that should be used for 
training, and separately, validation (referred to as testing for the Sphinx framework). 
The file list contains the “.fileids” files. The “.fileids” contain a single file name per 
line without the file extension. This is illustrated in Fig. 8.21. 

Next, we create the transcript files. The transcript files have one utterance tran- 
script per line with the “‘fileid” specified at the end. A sample of a transcript file is 
shown in Fig. 8.22. 


* http://sox.sourceforge.net/. 
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1 Ger Sconveri=toeway Ga): 

2 hile=path  wavepauhy = x 

3 iileenaiie — OS, patties plite xt (Os, pain basciame( tikes paths) [0] 
4 cmd = ”sox {} —-r {} —b 16 -c 1 {}”. format ( 

5 INS Ole Me al 

6 args.sample_rate , 

7 wav_path ) 

8 subprocess.call([cmd], shell—True) 


9 

10 With PhreadPool( 10) ase pool: 

11 pool.map(convert_to_wav, train_wav_-files ) 
12 


Fig. 8.20: Converts mp3 files to wav files, using the sox library. The function is 
parallelized to increase the speed of the conversion 


1 train_sample000000 
2 train_sample000001 
3 train_sample000002 
4 
5 


Fig. 8.21: Sample from the “common_voice_train.fileids”’ file 


| <s2 learn to) recognize omens and tolllew “them: the vold kine had 
said </s> (train_sample —000000) 

2 QoS SC VecrytniInG sin thempmiversc wevOlvVedeic sald </s> (tkaimesanipilic 
—000001) 

3 <s> you came so that you could learn about your dreams said the 
old woman </s> (train_sample —000002) 


Fig. 8.22: Sample from the “common_voice_train.transcript” file 


Once the transcript files are created, we turn our attention to the phonetic units 
used. In this example, the same phones are used as illustrated in Fig. 8.6, with one 
additional phone <S/L> to symbolize the silent token. 

The next step is to create the phonetic dictionary. We create a list of all words in 
the training dataset transcripts. The script in Fig. 8.23 shows a simple way of doing 
this. 

Next, we create a phonetic dictionary (lexicon) using the word and phone lists. 
Creating the lexicon model typically requires linguistic expertise or existing mod- 
els to create mappings for these words, as the phonetic representation should match 
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import collections 
import os 


counter — collections»: Counter () 
Pye Oem ac savant lke. ards mc savalniltc 
reader = csv. DictReader(csvfile ) 
Lor Lowe dn) reader. 
titans = Tow | text |] 
counter 4— collections. Coumter (trans .split()) 


Oo ONIN Nn BW NO 


ll Math open(os, path. joim(ctcadir. common=voice. words )2  w) as 1. 
12 Lot Item im "cOUltet 
13 fee wre ( ive mrs lowen ()set oc 1) 


Fig. 8.23: This script creates a file, “common_voice.words” that contains one word 
per line from the training data. Note: each word is only represented once in this file 


the pronunciation. To ease this dependency, CMU Lextool? is used to create our 
phonetic dictionary, and saved as “common_voice.dic’’. Note: there is some extra 
processing required here to ensure that there are no additional phones added to the 
representation than those specified in our “.phone” file. Additionally, in this exam- 
ple, all phones and transcripts are represented as lower case. The phonetic dictionary 
needs to match as well. A sample from the phonetic dictionary is shown in Fig. 8.24 


la ah 

2 Gay ey 

3 monk m ah ng k 
4 dressed (dr “eh ss, t 
5 in fiom 

6 black beiwac kk 
7 came k ey m 

8 to t uw 

9 


Fig. 8.24: Sample from the “common_voice_train.dic”’ file 


Our final step is to create a language model. Most language models follow the 
ARPA format, representing the n-grams and its associated probabilities once per 
line, with section delimiters for increases in the number of grams. We create a 3- 
gram language model using the CMUCLMTK, a language modeling toolkit from 
CMU. The script counts the different n-grams and computes the probabilities of 


> http://www.speech.cs.cmu.edu/tools/lextool.html. 
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each. The script is shown in Fig. 8.25 and a sample of the language model is shown 
in Fig. 8.26. 


| # Create vocab file 
2 Pex witedm«@sclc, cOmmonhev Olcclrallm(ralscriallon mw waikeq. vocdib. = 
Cle, COMMOnEVOlee. voudb 
3 
4# Create n—gram count from training transcript file 
5 text2Zidngram —vocab ¢tc/common_voice.vocab —idngram etc/ 
COMMONAVOlGe Mdnstaiime< sere, COMMONE=VOLCCELLaliett ans ti pilon 
6 
7 # Create language model from n—grams 
8 idngram2Ilm —vocab_type 0 —idngram etc/common-_voice.idngram —vocab 
etc/common_voice. vocab —arpa etc/common-_voice.lm 
9 
10 # Convert language model to binary (compression) 
ll Sphinx slimeconwerts—tecre, Common=volce wits —oO scrc, COMmmMmon=volce «lim: 
DMP 
{2 


Fig. 8.25: CMUCLMTK creating the language modeling file 


With the preprocessing complete, we are ready to train the ASR models. 


8.7.3.2 Model Training 


The model training process for Sphinx is straight-forward, with everything set up to 
follow the configuration file. To generate the configuration file, we run: 


| sphinxtrain —t common-_voice setup 


With the setup config, the Sphinx model can be trained by running: 


| Sphinxtrain run 


The training function runs a series of scripts that check the configuration and 
setup. It then performs a series of transforms on the data to produce the features, 
and then trains a series of models. The Sphinx framework achieves a WER of 39.824 
and CER of 24.828 on the Common Voice validation set. 


8.7.4 Kaldi 


In this section, we train a series of Kaldi models to train a high-quality ASR model 
on the Common Voice dataset. Training a high-quality model requires intermediate 
models to align the acoustic feature frames to the phonetic states. The code and 
explanations are adapted from the Kaldi tutorial. 
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1 \data\ 

2 ngram 1=8005 

3 Meianiee 2 sls 

4 ngram 3=49969 

5 

6 \l—grams: 

7 O87 1a se 0.0000 
8 =0.9757 <s> —4.8721 
9 |=1.6598 a —4.5631 
10 F037 OR aaron —1.2761 
11 —4.5116 abandon —1.7707 
12 —3.9910 abandoned 292 Soll 


15 \2—grams: 


17 Glee ee Oe 

mH lilis <8 Blondel Ww Ze) 

19 —2.8197 <s> about —2.4474 

20 —4.0634 <s> abraham 0.0483 

21 =2.9228 <s> absolutely —1.8134 


24 \3—grams: 

25 =0°9673 <s> a boy 

26 EG O77 25 amb ie ec Ze 
27 —2.6800 <s> a bunch 
28 —1.5866 <s> a card 
29 —2.2998 <s> a citation 


Fig. 8.26: Sample from the “common_voice.|lm”’ file in the standard ARPA format 


8.7.4.1 Data Preparation 


The data preparation in Kaldi is similar to the Sphinx preparation, requiring tran- 
scription and audio ID files, shown in Fig. 8.27. Kaldi has a set of scripts in place to 
automate the construction of these files, to reduce the manual work required in the 
Sphinx setup. 

Prepare a mapping from the “.wav’” files to the audio path. We create an utterance 
ID for each of the files. This utterance ID is used to tie the file to the different 
representations in the training pipeline. In Kaldi, this is treated as a simple text file, 
with the “.scp” extension. 

Secondly, a file mapping the utterance ID to the utterance transcript is created 
(Fig. 8.28).* This file will be used to create the utterance. 

A corpus file contains all the utterances from the dataset. It is used to compute 
the word-level decoding graphs for the system. 


4 If there are additional labels like speaker and gender, these can also be used in the process. 
Common Voice does not have these labels, so each utterance is treated independently. 
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spk2gender [<speaker—id> <gender >] 

wav. scp ISuterranccelD file pathetoe audioetite. | 
beret FS<utectrancc IDE <(extiramscripiion. =| 
utt2spk In WterrancelD speaker -| 

COnpuUSs txteil le tell mc crapialam: | 


DHnkK wn 
tr H +H 


Fig. 8.27: Files that need to be created for Kaldi 


1 dad_4_4_2 four four two 

2 (uly She SS one two tive 
lv LO ss Sie Elen wirSse 
4 # and so on... 


Fig. 8.28: Sample from the transcript file 


These files allow the rest of the required files to be generated, such as the “‘lexi- 
con.txt” file, which contains all the words from the dictionary with the phone tran- 
script. Additionally, there are non-silence and silence phone files that provide ways 
to handle non-speech audio (Fig. 8.29). 


!'SIL sil 

<UNK> spn 
Sil ot GNP 

five sf ay Vv 
four! aoe t 
nine n ay n 
one hh w ah n 
one w ah n 
seven S eh vy ah m1 
10 Six S. Ihe-k 6 

11 phe sul et wlby, 
12 two t uw 

13 Zevon Zi tf OW 
14 zero Z 1y Yt ow 


OO ON HD Nn BW NY KF 


Fig. 8.29: Sample from the “lexicon.txt” file 


Once these files are prepared, Kaldi scripts can be used to create the ARPA lan- 
guage model and vocabulary. With Sphinx, all the terms in the dictionary needed a 
lexical entry to be entered manually (we leveraged premade dictionaries and addi- 
tional inference dictionaries to accomplish this). The CMU dictionary is also used in 
this case, except in Kaldi, a pre-trained model is used to estimate the pronunciations 
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of OOV words.° Once the dictionary is prepared, the lexical model is built, with 
the phonetic representation of each word in the dataset. An FST is then constructed 
from the transcripts and lexicon model and used to train the model. 

The next step in the data preprocessing is to produce MFCCs for all of the train- 
ing data. A file for each utterance is saved individually to reduce feature generation 
for various experiments. During this process, we also create two smaller datasets: a 
dataset of the 10k shortest utterances and a dataset of the 20k shortest utterances. 
These are used to build the earlier models. Once features are extracted, we can train 
our models. 


8.7.4.2 Model Training 


Much of the model training is scripted in Kaldi and 1s straight-forward after the 
data preparation is completed. The first model that is trained is an HMM-GMM 
model. It is trained for 40 epochs through the 10k shortest utterances. This model 
is then used to align the 20k utterances. After each model is trained, we rebuild the 
decoding graph and apply it to the test set. This model yields a WER of 52.06 on 
the Validation (Fig. 8.30). 


1 steps/train_mono.sh ——boost—silence 1.25 ——nj 20 ——-cmd “run. pl — 
mem 8G” \ 

2 data/train_lOkshort data/lang exp/mono || exit 1; 

3 ff 

4 utils /mkgraph.sh data/lang_test exp/mono exp/mono/ graph 

5 fOr wtestsct site Vaiideudcy 1 do 

6 steps/decode.sh —nj 20 —cmd ”run.pl —mem 8G” exp/mono/ 
graph \ 

7 datay Stestsct cxp/mono, decode =Stcstset 

8 done 

9 )\& 


10 


Fig. 8.30: Train the monophone “mono” model with alignments on the 10k shortest 
utterances subset from the training data 


Next, we use the 20k alignments to train a new model incorporating the delta 
and double delta features. This model will also leverage triphones in the training 
process. The previous process will be performed again, with a separate training 
script, producing a model that achieves a WER of 25.06 (Fig. 8.31). 

The third model that is trained is an LDA+MLLT model. This model will be 
used to compute better alignments using the learnings of the previous model on the 
20 k dataset. So far, we have been using the 13-dimensional MFCC features. In this 


> Note: It is possible to add specific words to the lexicon by exiting the lexicon-iv.txt file. 
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1 steps/align_si.sh —boost—silence 1.25 —nj 10 —cmd "run. pl — 


mem 8G” \ 

2 data/train_20k data/lang exp/mono exp/mono_-ali_train_20k 

p) 

4 steps/train_deltas.sh —boost—silence 1.25 —-cmd ”$train_cmd” \ 

5 2000 10000 data/train_20k data/lang exp/mono-ali_train_20k 
expe wir 

6 

7 # decode using the tril model 

8 ( 

9 utilis; mkeraph. sh) “datay lame test vexpy tril exp? tril) eraph 

10 LOtevceotsc (acinar validmdicy ao 

11 steps/decode.sh —nj 20 —-cmd ”$decode_cmd” exp/tril / 
graph \ 

12 data/Stestsct cxp, trily decodesstestset 

13 done 

14 )& 


Fig. 8.31: Train another monophone “tril” model with alignments from the “mono” 
model on the 20k training subset 


model, multiple frames are considered in a single input t to provide more context 
at each state. The added input dimensionality increases the computational require- 
ments of the classifier, so we use linear discriminant analysis (LDA) to reduce the 
dimensionality of the features. Additionally, a maximum likelihood linear transform 
(MLLT) to further decorrelate the features and make them “orthogonal” to be better 
modeled by diagonal-covariance Gaussians [Rat+13]. The resulting model yields a 
WER of 21.69 (Fig. 8.32). 

The next model that is trained is the speaker adapted model referred to as 
“LDA+MLLT+SAT”. The 20k dataset is aligned again using the previous model, 
and using the same architecture as the previous model with the additional speaker 
adapted features. Because our data doesn’t include speaker tags, we would not ex- 
pect to get gains in this area, and we do not see any. The resulting model yields a 
WER of 22.25 (Fig. 8.33). 

We now apply the alignments computed with the previous model to the entire 
training dataset. We train another LDA+MLLT+SAT model on the new alignments. 
The resulting model gives a WER of 17.85 (Fig. 8.34). 

The final model is a TDNN model [PPK15]. The TDNN model is an 8 layer 
DNN, with batch normalization. This model requires a GPU to train due to the 
depth and need for parallelization (Fig. 8.35). 

The final model, after integrating the 8-layer DNN, achieves a WER of 4.82 on 
the validation set. 
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steps/align_si.sh —nj 10 —-cmd ”$train_cmd” \ 
data, trainee) Vikeedara, lame Sexo) trl sexn tiileali strains) 0k 


1 
Z 
a 
4 steps/train_lda_mllt.sh —-cmd ”$train_cmd” \ 

5 —splice—opts "——left —context=3 —right—context=3” 2500 15000 \ 
6 data; train Uk sdata, lane wexp, trilcaliciraine 20k tex, tr12b 

5 
8 
9 


# decode using the LDA+MLLT model 
utils /mkgraph.sh data/lang_test exp/tri2b exp/tri2b/graph 


10 ( 

11 Lon svestsc mine y dliuece v- ado 

i steps/decode.sh —nj 20 —-cmd ”$decode_cmd” exp/tri2b/ 
graph \ 

13 data Stestsct cxp, tri2by/ decodeestestsct 

14 done 

15 )& 


Fig. 8.32: Train the LDA+MLLT “tri2b” model with alignments from the “tril” 
model 


# Align utts using the tri2b model 
steps/align_si.sh —-nj 10 —-cmd ”$train_cmd” ——use-—graphs true \ 
daitays trait Ukemdaibay ittieee xn il 2 bene x Dy sett 2 bee alien ante 2.) ke 


data tram=ZUkewdatay lime sexy tii2bealtetroltia2 Vik exp, tei sb 


1 
2 
3 
4 
5 steps/train_sat.sh —-cmd ”$train_cmd” 2500 15000 \ 
6 
7 
8 # decode using the tri3b model 

9 


( 
10 ULIlS) mketaphe sh data, Wamesve st ecxp) tla be cxp/ (iis be ora h 
11 ore WS Sie iin welllta cle 2 alo 
12 steps/decode_fmllr.sh —-nj 10 —-cmd ”$decode_cmd” \ 
13 exp, tris by) Crap data, btestset exp, trish” 

decomWe=s fesisier 

14 done 
15 )& 


Fig. 8.33: Train the LDA+MLLT+SAT “tri3b” model with alignments from the 
“tri2b” model 


8.7.5 Results 


The results achieved throughout this case study are summarized in Table 8.1. The 
best Kaldi and Sphinx model are then evaluated on the test set. These results are 
shown in Table 8.2. We see that the addition of deep learning in the final Kaldi 
TDNN model shows significant quality improvements over the traditional learning 
algorithms for the acoustic model. 
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# Align utts an the tull training set using the (ri3b model 
steps/align_fmllr.sh —-nj 20 —-cmd ”$train_cmd” \ 
data/valid_train data/lang \ 
expr tis Di expr air is bea ievabideter acm 


# train another LDA+MLLI+SAT system on the entire training set 
steps/train_sat.sh -—cmd ”$train_cmd” 4200 40000 \ 
data/valid_train data/lang \ 
exp (is bedliay vider a ie ex py trib 


11 # decode using the tri4b model 


12 ( 

13 utils /mkgraph.sh data/lang_test exp/tri4b exp/tri4b/graph 
14 for testscr in validadey - do 

15 steps/decode_fmllr.sh —-nj 20 —cmd ”$decode_cmd” \ 
16 exp) tri4by graph dava/stestset | 

17 Exp, thi4by decode =tticstsct 

18 done 

19 )& 

20 


Fig. 8.34: Train the LDA+MLLT+SAT “tri4b” model with alignments from the 
“tri3b” model 


1 local/chain/run_tdnn.sh —stage 0 
2 


Fig. 8.35: Script: Train the TDNN model using the “‘tri4b” model 


8.7.6 Exercises for Readers and Practitioners 


Some other interesting problems readers and practitioners can try on their own in- 
clude: 


1. How are additional words added to the vocabulary”? 
2. Evaluate the real-time factor (RTF) for this system. 


Table 8.1: Speech recognition performance on Common Voice validation set. Best 
result is shaded 


Approach WER 

Sphinx 39.82 

Kaldi monophone (10 k sample) 52.06 

Kaldi triphone (with delta and double delta, 20k sample) 25.06 
Kaldi LDA+MLLT (20k sample) 21.69 

Kaldi LDA+MLLT+SAT (20k sample) 2225 

Kaldi LDA+MLLT+SAT (all data) 17.85 


Kaldi TDNN (all data) 4.82 


References 
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Table 8.2: Speech recognition performance on Common Voice test set. Best result 


is shaded 


Approach WER 
Sphinx 53.85 
KaldiTDNN 4.44 


3. What are some ways to improve quality on accented speech? 
4. How many states are in the set for a diphone model? How many for a triphone 


model? 
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Part III 
Advanced Deep Learning Techniques 
for Text and Speech 


Chapter 9 ®@ 
Attention and Memory Augmented oe 
Networks 


9.1 Introduction 


In deep learning networks, as we have seen in the previous chapters, there are good 
architectures for handling spatial and temporal data using various forms of convolu- 
tional and recurrent networks, respectively. When the data has certain dependencies 
such as out-of-order access, long-term dependencies, unordered access, most stan- 
dard architectures discussed are not suitable. Let us consider a specific example 
from the bAbI dataset where there are stories/facts presented, a question is asked, 
and the answer needs to be inferred from the stories. As shown in Fig. 9.1, it requires 
out of order access and long-term dependencies to find the right answer. 

Deep learning architectures have made significant progress in the last decade in 
capturing implicit knowledge as features in various NLP tasks. Many tasks such 
as question answering or summarization require storage of explicit knowledge so 
that it can be used in the tasks. For example, in the bAbI dataset, information about 
Mary, Sandra, John, football, and its location is captured for answering the question 
such as “where is the football?” Recurrent networks such as LSTM and GRU cannot 
capture such information over very long sequences. Attention mechanisms, mem- 
ory augmented networks, and some combinations of the two are currently the top 
techniques which address many of the issues discussed above. In this chapter, we 
will discuss in detail many popular techniques of attention mechanisms and memory 
networks that have been successfully employed in speech and text. 

Though the attention mechanism has become very popular in NLP and speech 
in recent times after Bahdanau et al. proposed their research, it had been previously 
introduced in some forms in neural architectures. Larochelle and Hinton highlight 
the usefulness of “fixation points” to improve performance in the image recognition 
tasks [LH10]. Denil et al. also proposed a similar attention model for object tracking 
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Mary moved to the bathroom. 


Sandra journeyed to the bedroom. 


| 


Mary got the football there. 


Stories /facts John went to the kitchen. 





Mary went back to the kitchen. 


went back to the garden. 





of pana 
Mai : 
i | 3 
icherctariereciagy 


Question Where is the football? 


Fig. 9.1: Question and answer task 





and recognition inspired by neuroscience models [Den+12]. Weston et al. pioneered 
the modern-day memory augmented networks but the origins trace back to the early 
1960s by Steinbuch and Piske [SP63]. Das et al. used neural network pushdown 
automaton (NNPDA) with external stack memory addressing the issues of recurrent 
networks in learning context-free grammars [DGS92]. Mozer in his work addressing 
the complex time series had two separate parts in the architectures: (a) short-term 
memory to capture past events and (b) associator to use the short-term memory for 
classification or prediction [Moz94]. 


9.2 Attention Mechanism 


In a much more general way, attention is a concept very well known in human psy- 
chology, where humans who are limited by processing bottlenecks have a selective 
focus on certain part of the information and ignore the rest of the visible informa- 
tion. Mapping the same concept of human psychology to sequence data such as text 
streams or audio streams, when we focus on certain parts of sequences or regions 
and blur the remaining ones during the learning, the process is called an attention 
mechanism. Attention was introduced in Chap. 7 while introducing recurrent net- 
works and sequence modeling. Since many techniques using attention are either 
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related to or are used in memory augmented networks, we will cover some of the 
modern techniques that have broad application. 


9.2.1 The Need for Attention Mechanism 


Let us consider a translation from English to French for a sentence “TI like coffee” 
which in French is ‘J’aime le café.’ We will use a machine translation use case 
with the sequence-to-sequence models to highlight the need for the attention mech- 
anism. Let us consider a simple RNN with encoder—decoder network as shown in 
Fig. 9.2. We see in the above neural machine translation that the entire sentence 


café <e05> 


Decoder Network 





Encoder hidden states 





5 Sa 33 


Encoder Network 


Word Embeddings 


———Saaees | | ——______# 


¢ } {OO 
like 





>| 
| 





coffee <e05> 


Fig. 9.2: Encoder—decoder using RNN for neural machine translation 


is compressed into a single representation given by the hidden vector s4, which is 
a representation of the entire sentence, and is used by the decoder sequence as an 
input for the translation. As the sequence length of the input increases, encoding 
the entire information in that single vector becomes infeasible. The input sequence 
in text normally has a complex phrase structure and long-distance relationships be- 
tween words, which all seem to be cramped in the single vector at the end. Also, in 
reality, all of the hidden values from the encoder network carry information that can 
influence the decoder output at any timestep. By not using all of the hidden outputs 
but just the single one, their influence may be diluted in the process. Finally, each 
output from the decoder may be influenced differently by each of the inputs, and it 
may not happen in the same order as in the input sequence. 
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9.2.2 Soft Attention 


In this section, we will introduce the attention mechanism as a way to overcome 
the issues with recurrent networks. We will start with the attention mechanism by 
Luong et al. which is more general and then describe how it differs from the original 
Bahdanau et al. attention-based paper [LPM15, BCB14b]. 


The attention mechanism has the following in the encoder and the decoder: 


(Fig. 9.3) 

e The source sequence which is of length n given by x = {x1,x2,...,Xn}. 

e The target sequence which is of length m given by y = {y1,y2,---,¥m}- 

e Encoder hidden states $1,82,...,Sy. 

e The decoder sequence has the hidden state given by h; for the output at 7 = 


Li Zis call: 
The source-side context vector c; at position i is the weighted average of previous 
states and alignment vector a;: 


Ci = Yai, iS; (9.1) 
j 
The alignment scores are given by: 
a; = align(hj,s;) (9.2) 
= softmax (score (hj, s;)) (9.3) 


a;; are called the alignment weights. The equation above captures how every in- 
put element can influence the output element at a given position. The predefined 
function score is called the attention score function and there are many variants 
of these that will be defined in the next section. 

The source side context vector ¢; and the hidden state h; are combined using con- 
catenation |¢;;h;| and the non-linear tanh operation to give the attention hidden 
vector h;: 


h; = tanh(W,(c;;hj]) (9.4) 
where the weights W,. are learned in the training process. 

The attention hidden vector h; is passed through a softmax function to generate 
the probability distribution given by: 


P(yily < i,x) = softmax(W,h;,) (9.5) 


— Bahdanau et al. use bidirectional LSTM layers in the encoders, and con- 
catenate the hidden states. 

— Bahdanau et al. use the previous state, i.e., h;_;, and the computational 
path is h|,_-} —> a; —> ¢; —> h; as compared to Luong et al. which has 
h; — aj — C; —> hy. 
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— Bahdanau et al. use the linear combination of previous state and encoder 
states in the scoring function given by score(s;,h;) = vj tanh(Wz,s; + 
U,h;). 

— Luong et al. have the input-feeding mechanism where the attentional hid- 
den vectors h; are concatenated with the target input. 






Attention Layer 
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Context Vector C; 







Alignment weights’ 
aj H 





QO 
() O} 
) O} 
hj-1 h; 
Current decoder 








Fig. 9.3: Step-by-step computation process for soft attention in the encoder—decoder 


network 


9.2.3 Scores-Based Attention 


Table 9.1 gives different ways attention score functions can be computed to give 
different flavors of attention. 


e The multiplicative and additive score functions generally give similar re- 
sults, but multiplicative score functions are faster in both computation and 
space-efficiency using efficient matrix multiplication techniques. 
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Table 9.1: Attention score summary 


Score name Score description Parameters References 

Concat (additive) score(s;,h;) = v, and W, trainable [LPM15] 
v, tanh(W,|s ;;hi]) 

Linear (additive) score(s;,h;) = V,, Uz, and W, [BCB 14b] 
v, tanh(W,s; + U,h;) trainable 

Bilinear (multiplicative) score(s;,h;) = h] WS; W,, trainable [LPM15] 

Dot (multiplicative) score(s;,h;) = h! Sj No parameters [LPM 15] 

Scaled dot (multiplicative) score(s;,h;) = “st No parameters [Vas+17c] 

Location-based score(s;,h;) = softmax(W,h! ) W,, trainable [LPM 15] 


e Additive attention performs much better when the input dimension 1s large. 
The scaled dot-product method defined above has been used to mitigate 
that issue in the general dot product. 


9.2.4 Soft vs. Hard Attention 


The only difference between soft attention and hard attention is that in hard at- 
tention it picks one of the encoder states rather than a weighted average over all the 
inputs as in soft attention. The hard attention is given by: 


¢; = argmax{Sj,S2,...,Sy} (9.6) 
aij 


Thus the difference between hard attention and soft attention is based on the search 
when the context is computed. 


Hard attention uses the argmax function which is not a continuous func- 
tion, not differentiable and hence cannot be used in standard backpropagation 
methods. Techniques such as reinforcement learning to select the discrete part 
and Monte Carlo based sampling are often used. Another technique is to use 
the Gaussian trick given in the next section. 


9.2.5 Local vs. Global Attention 


The soft attention methods such as the Bahdanau’s research are also referred to 
as global attention, as each decoder state takes “all” of the encoder inputs while 
computing the context vector. The process of iterating over all the inputs can be both 
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computationally expensive and many times impractical when the sequence length is 
large. 

Luong et al. introduced local attention which is a combination of the soft at- 
tention and the hard attention to overcome these issues [Luo+15]. One of the ways 
local attention can be achieved is to have a small window of the encoder hidden 
states used for computing context. This is called predictive alignment and it re- 
stores differentiability. 

At any decoder state for time 7, the network generates an aligned position p;, 
and a window of size D around either side of the hidden state of the position, 1.e., 
[pi — D, pj + D] is used to compute the context vector c. The position p; is a scalar 
computed using a sigmoid function on the current decoder hidden state h; and using 
the sentence length S as given by: 


pr = S- sigmoid(v} tanh(W>h;)) (9.7) 


where W, and v, are the model parameters to be learned to predict the position 
and S is the length of the sequence and p; € [0,5]. The difficulty is in how to fo- 
cus around the location p; without using the non-differentiable argmax. One way 
to focus the alignment near p;, a Gaussian distribution is centered around p; with 
standard deviation 0 = 5 given by: 


ae 
a; = align(s;,hj) exp (SS) (9.8) 


The schematic is shown in Fig. 9.4. 


9.2.6 Self-Attention 


Lin et al. introduced the concept of self-attention or intra-attention where the 
premise is that by allowing a sentence to attend to itself many relevant aspects can be 
extracted [Lin+17]. Additive attention is used to compute the score for each hidden 
state h;: 

score(h;) = v} tanh(W,h;) (9.9) 


Then, using all the hidden states H = {h,,...,h,,} attention vector a: 
a = softmax(v, tanh(W,H' )) (9.10) 


where W, and v, are weight matrices and vectors learned on the training data. The 
final sentence vector c is computed by: 


c = Hal (9.11) 


Instead of just using the single vector v,, several hops of attention are performed by 
using a matrix V which captures multiple relationships existing in the sentences and 
allows us to extract an attention matrix A as: 
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Fig. 9.4: Step-by-step computation process for local attention in the encoder- 
decoder network 


A = softmax(Vatanh(W,H')) (9.12) 


C= AH (9.13) 


To encourage diversity and penalize redundancy in the attention vectors, we use the 
following orthogonality constraint as a regularization technique: 


Q = |(AAT—T)|2. (9.14) 


9.2.7 Key-Value Attention 


Key-value attention by Daniluk et al. is another variant which splits the hidden layer 
into key-value where the keys are used for attention distribution and the values for 
context representation [Dan+17]. The hidden vector h; is split into a key k; anda 
value v; : |k;;v;| =h;. The attention vector a; of length L is given by: 


a; = softmax(v, tanh( Wj, |[k;_7;---;k;-;| + W21")) (9.15) 


where Vz, W;, W2 are the parameters. The context is then represented as: 
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Cj = Vi-L3° ;Vj-i|a! (9.16) 


9.2.8 Multi-Head Self-Attention 


Vaswani et al. in their work propose transformer network using multi-head self- 
attention without any recurrent networks to achieve state-of-the-art results in ma- 
chine translation [Vas+17c]. We will describe the multi-head self-attention in 
a step-by-step manner in this section and as illustrated in Fig.9.5. The source 
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Fig. 9.5: Multi-head self-attention step-by-step computation 


input with word,,word?,...,word, words is first mapped to an embedding layer to 
get vectors for the words x;,X2,...,X,. There are three matrices W2, W*, and WY 
called the query, key, and value weight matrices that are trained during the training 
process. The word embedding vectors are multiplied by the matrices W2, W*, and 
W’ to get the query, key, and value vectors, respectively, for each word given by 
q,k, and v. Next is to calculate a score for each word for every other word in the 
sentence by using the dot product of that query vector q and the key vectors k for 
every word. This scoring captures a single interaction of the word with every other 
word. For instance, for the first word: 


score; = Qik) + qok2 + --- + qnky 


This is then divided by the length of the key vector \/d; and a softmax is computed 
to get weights between O and 1. This score is then multiplied with all the value 
vectors to get the weighted value vectors. This gives attention or focus to specific 
words in the sentence rather than every word. Then the value vectors are summed to 
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compute the output attention vector given for that word given by: 
Z| = Score,V\ + Score|V2 +:::+SCOre1| Vy, 


Now, instead of this step-by-step computation, the whole thing can be computed 
by taking the sentence representation as embedding matrix of all word vectors X 
and multiplying it with the respective weight matrices W2, W*, and W" to get the 
matrices for all the words as Q,K, and V and then using the equation to compute 
the attention: 





attention(Q, K, V) = Z = softmax ( Wv (9.17) 


Instead of using just one attention as computed above, they use multi-head atten- 
tion where there are many such attention matrices computed for the input and can 
be represented by Zo, Z1,...,Zm. These matrices are concatenated and multiplied 
by another weight matrix W% to get the final attention Z. 


9.2.9 Hierarchical Attention 


Yang et al. used hierarchical attention for document classification tasks showing the 
advantage of having attention mechanisms at sentence level for the context and the 
word level for importance [Yan+16]. As shown in Fig. 9.6, the overall idea is to 
have word level encoding using bidirectional GRUs, word level attention, sentence 
level encoding, and sentence level attention hierarchically. We will briefly explain 
each of these components. 

Let us consider input as a set of documents, each document has a maximum L 
sentences, and each sentence has a maximum T words such that w; represents a ¢th 
word in the ith sentence in a document. The sentences with all the words go through 
an embedding matrix W, that converts them to a vector x;; = W.w;;. It then goes 
through bidirectional GRU as: 


Xix = Wewi,t € [1,T] (9.18) 
hi, = GRU’ (x;;),¢ € [1,7] (9.19) 
h* = GRU*(x;,),t € [T, 1] (9.20) 


The hidden state for the word w;; 1s obtained by concatenating the two vectors from 
above h;, = [h;, hé), thus summarizing all the information around it. 

The word annotation hj; gets fed to a one-layer MLP first to get the hidden rep- 
resentation u,;;, which is then used to measure importance with a word level context 
vector u,,, get a normalized importance through a softmax, and use that to compute 


the sentence vector s; with weighted sum of annotations and weights. The context 
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vector U,, 1S initialized randomly and then learned in the training process. The intu- 
ition behind the context vector u,, according to the authors is that it captures a fixed 
query like “what is the informative word” in the sentence. 


uj; = tanh(W,,hj; + D,,) (9.21) 


where W,,,5,, are the parameters learned from the training process. 


exp(U, Uy) 
it = ———— 9.22 
“ Yexp(u/u,,) ee 
t 
s = > ajrhir (9.23) 
t 


Given L of the s; sentence vector, the document hidden vectors are computed similar 
to word vectors using bidirectional GRUs. 


h} = GRU" (s;),7 € [1,L] (9.24) 


h* = GRU*(s;),i € [L, 1] (9.25) 


Similar to word annotations, concatenating both the vectors captures all the summa- 
rizations across sentences from both directions given by h; = [h‘ ;h‘]. The sentence 
context vector uy is used in a similar way to the word context vector u,, to obtain 
attention among the sentences to get a document vector v: 


u; = tanh(Wsh; + b;) (9.26) 


where W,,b, are the parameters learned from the training process. 


T 
pec (9.27) 
Lexp(u; Us) 
v= > oh; (9.28) 


The document vector v goes through the softmax for classification and negative 
log-likelihood of label to prediction is used for training. 


In practice, if there is a document classification task, hierarchical attention be- 
comes a good choice compared to other attention mechanisms or even other 
classification techniques. It helps to find both important keywords in the sen- 
tences and important sentences in the document through the learning process. 
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Fig. 9.6: Hierarchical attention used in document classification 


9.2.10 Applications of Attention Mechanism in Text and Speech 


Many NLP and NLU research have used attention mechanisms for tasks such as sen- 
tence embedding, language modeling, machine translation, syntactic constituency 
parsing, document classification, sentiment classification, summarization, and di- 
alog systems to name a few. Lin et al.’s self-attention using LSTM for sentence 
embedding showed significant improvements over other embeddings on a variety 
of tasks such as sentiment classification and textual entailment [Lin+17]. Daniluk 
et al. applied attention mechanisms to language modeling and showed compara- 
ble results to memory augmented networks [Dan+17]. Neural machine transla- 
tion implementations discussed in the chapter has achieved the state-of-the-art re- 
sults [BCB14b, LPM15, Vas+17c]. Vinyals et al. showed that attention mechanisms 
for syntactic constituency parsing could not only attain state-of-the-art results but 
also improve on speed [Vin+15a]. Yang et al.’s research showed that using hierar- 
chical attention can outperform many CNN and LSTM based networks by a large 
margin [Yan+16]. Wang et al. showed that attention-based LSTM could achieve 
state-of-the-art results in the aspect-level sentiment classification [Wan+16b]. Rush 
et al. showed how local attention methods could give significant improvements in 
the text summarization task [RCW 15]. 

Chorowski et al. introduced how attention mechanisms can achieve better 
normalization for smoother alignments and using previous alignments for gen- 
erating features in speech recognition [Cho+15b]. Bahdanau et al. used end- 
to-end attention-based networks for large vocabulary speech recognition prob- 
lems [Bah+16b]. Listen, attend, and spell (LAS), an attention-based model, has 
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been shown to outperform sequence-to-sequence approach [Cha+ 16a]. Zhang et 
al. in their research showed how using attention mechanisms with convolutional 
networks can achieve state-of-the-art results in the speech emotion recognition 
problem [Zha+ 18]. 


9.3 Memory Augmented Networks 


Next, we will describe some well-known memory augmented networks that have 
been very effective in NLP and speech research. 


9.3.1 Memory Networks 


Memory networks (MemNN) by Weston et al. were motivated by the ability to store 
information coming from stories or knowledge base facts so that various questions 
pertaining to these can be easily answered [WCB14]. Memory networks have been 
extended in many ways for various other applications but performing question and 
answer on the stories or facts can be considered its basic application that we will 
focus on our narrative. 

Memory networks consists of memory m indexed by m, and has four components 
as shown in Fig. 9.7 


1. Input Feature Map I: This component converts the incoming data to the internal 
feature representation. This component can do any task-specific preprocessing 
such as converting the text to embeddings or POS representations, to name a 
few. Given an input x, the goal is to convert to an internal feature representation 
I(x). 

2. Generalization G: This component uses the representation of the input from 
above and updates the memory by using any transformation if necessary. The 
transformation can be as simple as using the representation as is or making a 
coreference resolution, to complex reasoning based on the tasks. This transfor- 
mation is given by: 

M7 (x) = I(x) (9.29) 


In general the updating of the memories m; for the new input is given by H(.), 
which is a general function that can be used to do various things from simplest 
such as finding a slot index in the memory to the complex part of finding the 
slot if full or forget certain memory slot. Once the slot index is detected, the G 
stores the input /(x) in that location: 


m; = G(m,,/(x),m)Vi (9.30) 
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3. Output O: This is the “read” part from the memory where necessary inference 
to deduce relevant parts of memories for generating the response happens. This 
can be represented as computing the output features given the new input and 
memory as o = O(/(x),m). 

4. Response R: This component converts the output from memory into a repre- 
sentation that the outside world can understand. The decoding of the output 
features to give the final response can be given as r = R(o0). 


Input Generalization Output Response 





O(I(x),m) 





Memory 


Fig. 9.7: Memory networks 


In the paper the input component stores the sentence as is, for both the stories and 
questions. The memory write or generalization is also basic writing to the next slot, 
Le., My = x, N=N-+1. Most of the work is done in the output O,R part of the 
network. 

The output module finds the closest match for the input using k memories that 
support the fact and a scoring function 


Ox = argmax so(x,m;) (9.31) 
id We 7) 


The sg is the scoring function that matches the input question or the input fact/story 
sentence to all the existing memory slots for the best match. In the simplest case, 
they choose k = 2 for the output inferencing. This can be represented as: 


0, = O;(x,m) = argmax so(x,m;) (9.32) 
i=1,....N 
02 = O2(x,m) = argmax so(|x,mo;],m;) (9.33) 
i=1,....N 


The input |x, mg ,Mgz| is given to the response component which generates a single 
word with highest ranking given by: 
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r= argmax sr([x,mg;,Moz],w) (9.34) 
weEw 


where W is the set of all words in the vocabulary and sp is the scoring function to 
match words to the inputs. In the paper the scoring functions sg and sr have same 
form and can be written as: 


s(x,y) = bx(x)! UTU dy(y) (9.35) 


The matrix U is an x D dimensional, where n is the embedding size and D is the 
number of features. The matrices @,, @, represent a mapping of the original text to 
the D-dimensional feature space. The feature space chosen in the paper was the bag 
of the words over the vocabulary W and D = 3|W| for both sg, and sp, i.e., every 
word has three representations one for ¢,(-) and two for ¢,(-) based on whether the 
word is in the input or the supporting memories and can be modeled separately. The 
parameters of U in both o and r are separate and trained using the marginal loss 
function given by: 


by max(0, Y— so(x,mo1) +5o(x, f))+ 


fAmo 
Sy max(0, ¥— so([x,mo1],mo2z) + so([x,mo1], f’))+ 
f' Amo? 
SY) max(0,7— se([x,mo1,Mo2],r) + sr([xX,mMo1,Mo2],7)) (9.36) 
Tr 


where f ; f ’ 7 are other choices apart from the true label, 1.e., it adds a margin loss if 
the score of the wrong choices is greater than the ground truth minus y. 


The scoring function 0; and 02 given in Eqs. 9.32 and 9.33 can be computa- 
tionally expensive when the memory storage is large. The paper uses a cou- 
ple of tricks such as hashing the words and clustering word embeddings in a 
cluster k. The clustering approach gives a nice trade-off between speed and 
accuracy with the cluster size k choice. 


Let us take a simple example with two supporting facts dataset from the bADI al- 
ready in the memory slots given in the table below. When the question “Where is the 


memory slot (m;) sentence 

Mary moved to the bathroom. 
Sandra journeyed to the bedroom. 
John went to the kitchen. 

Mary got the football there. 

Mary went back to the kitchen. 
Mary went back to the garden. 


NM BWN 
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football?” is asked, the input after going through k = 2, x = “Where is the football?” 
is matched to everything in memory and the slot mg; = “Mary got the football 
there” and using this, i.e., [x, mj], it will perform another similarity search and find 
Mg2 = “Mary went back to the garden” giving rise to the new output [x,mg;,mz]. 
The R component uses |x,mo1,Mgz| input to generate an output response r = “gar- 
den.” 


9.3.2 End-to-End Memory Networks 


To overcome the issues of memory networks such as the need for each of the 
component to be trained in a supervised manner, issues of training hard atten- 
tion, to name a few, Sukhbaatar et al. proposed end-to-end memory networks or 
MemN2N. MemN2N overcomes many of the disadvantages of MemNN by having 
soft attention while reading from the memory, performing multiple lookups or hops 
on memory, and training end-to-end with backpropagation with minimal supervi- 
sion [Suk+15]. 


9.3.2.1 Single Layer MemN2N 


The MemN2N takes three inputs; (a) the story/facts/sentences xX] ,X2,...,Xy, (b) the 
query/question q, and (c) the answer/label a. We will walk through different com- 
ponents and interactions of MemN2N architecture next considering only one layer 
of memory and the controller. 


9.3.2.2 Input and Query 


The input sentences, for instance, x; is the ith sentence with words w;; as given by 
Xj = Xi1,Xj2,---,Xin, are converted into a memory representation m;,m2,...,m, of 
dimension d using embedding matrix A of dimension d x |V|, where |V| is the size 
of the vocabulary. The operation is given by: 





m; = ) Axi; (9.37) 
J 


The paper discusses different ways of combining word embeddings for all the words 
in the sentence, for example, by performing a sum operation on all the word embed- 
dings to get a sentence embedding. Similarly, the query or the question sentence is 
mapped to a vector of dimension d using embedding matrix B of dimension d x |V]. 
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9.3.2.3 Controller and Memory 


The query representation u from the embedding matrix B for the controller is then 
matched with every memory index m; using the dot product for similarity and soft- 
max for choosing the state. The operation can be given by: 


7 exp(u/ m;) 


= 9.38 
yj exp(u/ m,) ( ) 


9.3.2.4 Controller and Output 


Each input sentences x; are also mapped to the controller as vectors c; of dimen- 
sion d using a third embedding matrix C of dimension d x |V|. The output is then 
combined using the softmax outputs p; and the vector c; as: 


o= > pic (9.39) 


9.3.2.5 Final Prediction and Learning 


The output vector 0 and input query with embeddings u are combined and then 
passed through a final weight matrix W and a softmax to produce the label: 


exp(W(0 + u)) 


9.40 
> exp(W(o+u)) Cy) 


a= 


The true label a and the predicted label 4 are used to then train the networks includ- 
ing the embeddings A, B, C, and W using cross-entropy loss and stochastic gradient 
descent. A single layer MemN2N with complete flow from input sentences, query, 
and answer is shown in Fig. 9.8. 


9.3.2.6 Multiple Layers 


The single-layered MemN2N is then extended to multiple layers as shown in Fig. 9.9 
in the following way: 


e Each layer has its own memory embedding matrix A for input and controller/out- 
put embedding matrix C. 
e Each layer input K + 1 combines the output of current layer o* and its input ué 
using: 
u‘t! = o*+u" (9.41) 
e The top layer uses the output with softmax function in a similar way to generate 
a label a. 
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Fig. 9.8: Single-layered MemN2N 


e The final output 4 is similarly compared to the actual label a, and the entire 
network is trained using cross-entropy loss and stochastic gradient descent. 


Since many of the tasks such as QA require temporal context, 1.e., an en- 
tity was at some place before going to another place, the paper modifies the 
memory vector to encode the temporal context using a temporal matrix. For 
example, the input memory mapping can be written as: 


m; = )) Ax; +T,(i) (9.42) 
i 


where T(i) is the ith row of a temporal matrix T. 


9.3.3 Neural Turing Machines 


Graves et al. proposed a memory augmented network called neural Turing machines 
(NTM) for performing complex tasks that were repetitive and needing information 
over longer periods [GWD14b]. As shown in Fig. 9.10, it has a neural network com- 
ponent called controller which interacts with the outside world and the inner mem- 
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Fig. 9.9: Multiple layered MemN2N 


ory for all its operations. Drawing inspiration from Turing machines, the controller 
interacts with the memory using the read heads and the write heads. Since mem- 
ory read or write can be seen as discrete and non-continuous operations, it cannot 
be differentiated, and thus most gradient-based algorithms cannot be used as is. 
One of the most important concepts introduced in the research was to overcome 
this issue by using blurry operations in both read and write that interact with all 
the memory elements with varying degrees. By using these blurry operations, all 
reading and writing can be continuous, differentiable, and learned effectively using 
gradient-based algorithms such as stochastic gradient descent. 

Let us consider memory M to be a two-dimensional matrix (N x M) with N rows 
corresponding to the memory and M columns for each row where values get stored. 
Next, we will discuss different operations in NTM. 


9.3.3.1 Read Operations 


Attention is used to move the read (and write) heads in NTM. The attention mech- 
anism can be written as a length-N normalized weight vector w,;, reading contents 
from the memory M, at a given time f. Individual elements of this weight vector 


will be referred to as w;(i). 
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Fig. 9.10: Neural Turing machines 
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The constraints on weight vectors are: 


Vie {1...N}0 <w,(i) <1 (9.43) 


N 
Y wr(i) = 1 (9.44) 
i=] 


The read head will return a length-M read vector 7, which is a linear combination 
of the memory’s rows scaled by the weight vector as given by: 


M 
r, <— > w;(i)M,(i) (9.45) 
i=] 


As the above equation is differentiable, the whole read operation is differentiable. 


9.3.3.2 Write Operations 


Writing in NTM can be seen as two distinct steps: erasing the memory content and 
then adding new content. The erasing operation is done through a length-M erase 
vector e; in addition to the weight vector w; to specify which elements in the row 
should be completely erased, left unchanged, or some changes carried out. Thus the 
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weight vector w; gives us a row to attend and the erase vector e; erases the elements 
in that row giving the update: 


Me"*8¢4 (7) < My _ 1 (i) [1 —w; (Dey) (9.46) 


After the erase state, 1.e., M;_; converted to Miresee, the write head uses a length-M 
add vector a; to complete the writing as given by: 


M, (i) — Mo" (7) + w; (i)ay (9.47) 


Since both erase and write operations are differentiable, entire write operation is 
differentiable. 


9.3.3.3 Addressing Mechanism 


The weights used in reading and writing are computed based on two addressing 
mechanisms: (a) content-based addressing and (b) location-based addressing. 
The idea behind content-based addressing is to take information generated from the 
controller, even if it is partial, and find an exact match in the memory. In certain 
tasks, especially variable-based operations, it is imperative to find the location of 
the variables for tasks such as iterations and jumps. In such cases, location-based 
addressing is very useful. 

The weights are computed in different stages and passed on to the next stage. 

We will walk through every step in the process of computing the weights as given 
in Fig. 9.11. The first stage known as content addressing takes the two inputs: a key 
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vector k, of length- and a scalar key strength B;. The key vector k, is compared 
to every vector M,(i) using a similarity measure K|-,-|. The key strength, B,, acts 
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puts focus on certain terms or deemphasizes them. The content-based addressing 
produces the output wy as given by: 


wo = EPUB K lke) Mi (F))) 
' Sjexp(BK [kr,M:(4))) 


Location-based addressing is performed in the next three stages. The second stage, 
called interpolation g, € (0,1), takes a scalar parameter from the controller head 
which is used to combine the content weight from the previous step wy and previous 
time step’s weight vector w;_; to generate the gated weighting w* given by: 


(9.48) 


w = 2,W, + (1 —_ 21) Wr-1 (9.49) 


The next stage is a convolutional shift which works to shift attention to other rows. It 
takes a shift vector s,; from the controller head as input and the previous interpolated 
output w?. The shift vector can have various values such as +1 to shift forward 
one row, 0 to stay as is, and —1 to shift backward one row. The operation is a shift 
modulo WN so that attention shift of the bottom moves the head to the top and vice 
versa. The convolution shift is given by w; and the operation 1s: 


ye by w7(j)8s;(i— J) (9.50) 


The final stage is sharpening, which prevents the previous convolution shifted 
weights from blurring using another parameter y > | from the controller head. The 
final output of the weight vector w; is given by: 


Ww, (i) 


WOH Fe v7 


(9.51) 


Thus the address of reading and writing is computed by the above operations and 
all of the parts are differentiable and hence can be learned by gradient-based al- 
gorithms. The controller network has many choices such as type of neural network, 
number of read heads, number of write heads, etc. The paper used both feed-forward 
and LSTM based recurrent neural network for the controller. 


9.3.4 Differentiable Neural Computer 


Graves et al. proposed a differentiable neural computer (DNC) as an extension and 
improvement over the neural Turing machines [Gra+16]. It follows the same high- 
level architecture of controller with multiple read heads and single write head af- 
fecting the memory as given in Fig. 9.12. We will describe the changes that DNC 
makes to NTM in this section. 
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Fig. 9.12: DNC addressing scheme 


9.3.4.1 Input and Outputs 


The controller network receives input vector x, € R* at every time step and gen- 
erates an output y, € R’. It also receives as input rather R read vectors the previ- 
ous time step as r/_,,...,r/*_, from the memory matrix M,_; € R‘*™ via the read 
heads. The input and the read vectors are concatenated as single controller input 


Xeon = | r 4 ee pean The controller uses a neural network such as LSTM. 


9.3.4.2 Memory Reads and Writes 


Location selection happens using weight vectors that are non-negative and sum to 
1. The complete set of “allowed” weighting over N locations in the memory is given 
by a non-negative orthant and constraints as: 


N 
Ay =4aER*G;€ [0,1], Ya <1 (9.52) 
i=l 
The read operation is carried out using R read weights {wrt ba wwe} € Ay, thus 
giving read vectors {r},...,r®} by equation: 
ri = Mi wy" (9.53) 


The read vectors get appended to the controller input at the next time step. 

The write operation is carried out by write weighting w!” € R” together with 
write vector v; € R™ and the erase vector e; € [0,1] both emitted by the controller 
to modify the memory as: 
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M, = M,_1 0 (E—w)’e)) +wy'v; (9.54) 


where © represents element-wise multiplication and E is N x M matrix of ones. 


9.3.4.3 Selective Attention 


The weightings from controller outputs are parameterized over the memory rows 
with three forms of attention mechanisms: content-based, memory allocation, and 
temporal order. The controller interpolates among these three mechanisms using 
scalar gates. 

Similar to NTM, selective attention uses a partial key vector k,; of length-W and 
a scalar key strength B,. The key vector k, is compared to every vector M, |i] us- 
ing a similarity measure K{-,-] to find the closest to the key normally using cosine 
similarity as given by: 


dj exp(B-K|[k,,M,|J]!) 


The C(M,k, 8.) NTM’s drawback of allocating only contiguous blocks of memory 
is also overcome in DNC. DNC defines the concept of a differentiable free list for 
tracking the usage (u,) of every memory location. Usage is increased after each 
write (w}") and optionally decreased after each read (w,’) by free gates (f') given 
by: 


C(M,k, B)|i] = (9.55) 


R 
u; = (uy) + wy; —uy-1 ow" ,)o] [1 —-fw;"’) (9.56) 
i=] 
The controller uses an allocation gate (g’ € [0, 1]) to interpolate between writing to 
the newly allocated location in the memory (a;) or an existing location found by 
content (¢)”) with gi’ € [0,1] being the write gate: 


w,’ = 27 [(grar + (1 — 97 Je”) (9.57) 


Another drawback of NTM was the inability to retrieve memories preserving 
temporal order which is very important in many tasks. DNC overcomes this by 
having an ability to iterate through memories in the order they were written. A 
precedence weighting (p;) keeps track of which memory locations were written to 
most recently using: 


Pp = (1 - Ww il) pe +w (9.58) 


A temporal link matrix (L,(i, j] < R”*”) represents the degree to which location 
i was the location after location 7. The matrix gets updated using the precedence 
weight vector p; as given by: 


L, |i, j] = 1 — wl] — we Li) eal, i] + we [Pra] (9.59) 
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The controller can use the temporal link matrix to retrieve the write before (b‘) or 
after (f') the last read location (wi 1) allowing the forward and backward movement 
in time given by the following equations: 


bi =Liw”, (9.60) 


fi=Lw", (9.61) 


In the paper, the temporal link matrix is N x N and thus the operation related to 
memory and computation is of the order O(N”). Since the matrix is sparse, the 
authors have approximated it using a fixed length K to approximate the vectors 
Wr, p;_1 for write weight and the precedence weighting. This is further used to 
compute the approximate temporal link matrix L, and thus the new forward and 
backward movement fi and bi, respectively. They saw faster performance without 
any noticeable degradation to the effectiveness using an approximate method. . 
The read head i computes the content weight vector ¢;” using the read key k,” 

using: . a 
c, = C(M,,k,”, B;”) (9.62) 


The read head gets the inputs from three-way gates (z/) and uses it to interpolate 
among iterating forward, backward, or by content given by: 


wy = mi [1]b! + a! [2]e" + 2! [21 (9.63) 


9.3.5 Dynamic Memory Networks 


Kumar et al. proposed dynamic memory networks (DMN), where many tasks in 
NLP can be formulated as a triplet of facts-question—answer and end-to-end learn- 
ing can happen effectively [Kum+16]. We will describe the components of DMN 
as shown in Fig. 9.13. We will use the small example given in Fig. 9.13 to explain 
each step. 


9.3.5.1 Input Module 


The input module takes the stories/facts, etc. as sentences in raw form, transforms 
them into a distributed representation using embeddings such as GloVe from the 
memory module, and encodes it using a recurrent network such as GRU. The input 
can be a single sentence or list of sentences concatenated together—say of T7 words 
given by w1,...,Wz,. Every sentence is converted by adding an end-of-sentence to- 
ken and then the words are concatenated. Each end-of-sentence generates a hidden 
state corresponding to that sentence like h, = GRU(L(w;),h;—1), where w; is the 
word w; index at time ¢ and L is the embedding matrix. The outputs of this input 
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Fig. 9.13: Dynamic memory networks (DMN) 


module are length 7. fact sequences of hidden state of each sentence and ¢; is the 
fact at step rt. In the simplest case where each sentence output is encoded as a fact 
T.. is equal to the number of sentences. 


9.3.5.2 Question Module 


The question module is similar to the input module, where the question sen- 
tence of Tg words is converted to an embedding vector, and given to the recur- 
rent network. The GRU based recurrent network is used to model it given by 
q; = GRU (L(w®), qr_1), where L is the embedding matrix. The hidden state, which 
is the final state at the end of the question, was given by q = qz,. The word embed- 
ding matrix L is shared across both the input and the question module. 


9.3.5.3, Episodic Memory Module 


The hidden states of the input module across all the sentences and the question 
modules output are the inputs for the episodic memory module. Episodic memory 
has an attention mechanism to focus on the states from the input and a recurrent 
network which updates its episodic memory. The episodic memory updates are part 
of an iterative process. In each iteration the attention mechanism attends over the 
input module’s hidden states mapping to the fact representation c, question q, and 
the past memory m’~! to produce an episode e’. The episode is then used along 
with previous memory m‘~! to update episodic memory m! = GRU (e',m‘~'). The 
GRU is initialized with a question as the state, 1.e., m? = q. The iterative nature 
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of the episodic memory helps in focusing on different sections of the inputs and 
thus has the transitive nature required for inferencing. The number of passes Ty is a 
hyperparameter and the final episodic memory m/” is given to the answer module. 
The attention mechanism has a feature generation part and a scoring part using 
the gating mechanism. The gating happens through a function G that takes as input 
a candidate fact ¢;, previous memory m‘~!, and the question q to compute a scalar 
g, which acts as a gate: | | 
g, = G(c;,m'",q) (9.64) 
The feature vector z(c,m,q) which feeds into the scoring function G above using 
different similarities between the input facts, previous memory, and the question as 
given by: 








z(c,m,q) = |com;coq;|c — m|;c — m] (9.65) 


where © is the element-wise product between the vectors. The scoring function G is 
the standard two-layer feed-forward network where 


G(c,m,q) = o(W”) tanh(W“)z(c,m,q) +b!) +b’) (9.66) 


where weights W\) Ww) and biases b!, b? are learned through the training process. 
The episode at iteration 7 uses the GRU with sequences ¢),...,¢7, weighted by the 
gate g’ and the final hidden state is used to update as given by: 


hy = g;GRU (¢;,hy_;,) + (1 — g;) hy (9.67) 


e = hy, (9.68) 


Either maximum iteration is set or a supervised symbol to mark the end of phase 
token is passed to stop the iteration. 


9.3.5.4 Answer Module 


The answer module can either be triggered at the end of every episodic memory 
iteration or the final one based on the task. It is again modeled as a GRU with an 
input question, last hidden state a;_;, and previous prediction y,_,. The initial state 
ao is initialized to the last memory as ag = m/™. Thus the updates can be written as: 


Yr = so ftmax( Wa; ) (9.69) 


at = GRU ([y-1,q], ar—1) (9.70) 


9.3.5.5 Training 


The end-to-end training is done in a supervised manner where the answer generated 
by the answer module is compared to the real labeled answer, and a cross-entropy 
loss is propagated back using stochastic gradient descent. To give a concrete exam- 
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ple, let us consider the story with sentences $;,...,S6 as inputs to the input module 
and question q Where is the football? to be passed to the question module as shown 
in Fig. 9.13. In the first pass of the episodic memory let us assume it will try to 
attend to the word football from the question, all the facts ec coming as hidden states 
from the input modules and will score all the facts from input where football ap- 
pears and give maximum to facts such as Mary got the football there. In the next 
iteration, it will take the output from this episodic state and try to focus on the next 
part Mary and thus select all statements such as Mary moved to the bathroom, Mary 
got the football there, Mary went back to the kitchen, and Mary went back to the 
garden. From these let us assume it will select the last sentence Mary went back 
to the garden. The selection of the right sentences to focus happens in an end-to- 
end manner using backpropagation, where the actual label from the answer module 
garden compares the generated output to propagate the errors back. 


9.3.6 Neural Stack, Queues, and Deques 


Grefenstette et al. explore learning interactions between the controller and memory 
using the traditional data structures such as stacks, queues, and deques. They pro- 
duce superior generalization when compared with RNNs [Gre+ 15]. In the next few 
sections, we will explore the basic working of neural stack architecture and then 
generalize it to others. 


9.3.6.1 Neural Stack 


The neural stack is a differentiable structure which allows storing the vectors 
through push operations, and retrieving the vectors through pop operations anal- 
ogous to the stack data structure as shown in Fig. 9.14. 

The entire stack content at a given time ¢ is denoted by matrix V;, each row 
corresponding to memory address i contains a vector v; of size m such that it is in 
the space IR”. Associated with each index in the matrix is a strength vector giving 
the weight associated with that index of content and is given by s;. The push signal 
is given by a scalar d,; € (0, 1) and the pop signal is given by a scalar u, € (0,1). The 
value read from the stack is given by r; € IR”. 

The necessary operations for the neural stack are given by the following three 
equations for V;, s;, and r;: 


(9.71) 


_ [V4[) ifl<i<t 
V, |i] = ae ; 
v; ifi=t,V;|i] =v, for alli<t 
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Equation 9.71 captures the updates to the stack as an ever-growing list-like structure 
of a neural stack where every old index gets the value from the previous time step 
and the new vector is pushed on the top: 


| 
0,s,_1 [i] — O4—- Y s—1/j ifl<i<t 
s (i) vax 1-1 (1 ax uy 2 , ia)) i i (0.72) 
a ifi=t 


Equation 9.72 captures the weight updates, where the case i = ¢t means that we di- 
rectly pass the push weight d, € (0,1). Removing a entry from stack doesn’t remove 
it physically but sets the strength value at the index O. Each of the strengths lower 
down the stacks changes based on the following calculation, subtract the pop signal 
strength u; and the relative sum above that index i+ | and below index at value t — 1 
and cap it by finding maximum between that value and 0. Then subtract it with the 
current value at the index s;,; and cap it by finding maximum between that value 
and 0. 

We look at Fig. 9.14 at time t = 3 and lowest index i= | assuming a with previous 
value of 0.7, it will become max(0,0.7 — max(0,0.9—0.5)) = 0.3. Similarly, we can 
plug the same value for t = 3 and next index i = 2 with previous value of 0.7, it will 
become max(0,0.9 — max(0,0.9 —0)) = 0. Finally, at t = 3, the top index i = 3 will 
have the d; value of 0.9: 


y= ¥ min (. li], max (o, 1- y S; || wa) (9.73) 


i=1 jJ=it+l1 


Equation 9.73 can be seen as the state that the network sees at time f. It is a com- 
bination of the index vector and its strength, where the strengths are constrained to 
sum to 1. 

Again when we look at Fig. 9.14 at time t = 3, we see that everything is normal 
combinations except the strength of index 1 is changed from 0.3 to 0.1 because 
substituting we get min(0.3,max(0, 1 —0.9)) = 0.1. 


Time progression 
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Fig. 9.14: Neural stack states with respect to time and operations of push and pop 
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9.3.6.2 Recurrent Networks, Controller, and Training 


The gradual extension of the neural stack from above as a recurrent network and the 
controller actions are shown in Fig. 9.15a, b. The entire architecture marked with 
dotted lines is a recurrent network with inputs (a) previous recurrent state H,;_; and 
(b) current input i;; and outputs (a) next recurrent state H; and (b) o,. The previous 
recurrent state H;_; consists of three parts: (a) the previous state vector from RNN 
h,_1, (b) the previous stack read r;, and (c) the state of the stack from previous state 
(V;_1,8;). In the implementation, all of the vectors except ho, which is randomly 
initialized, all are set to 0 to start with. 
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Fig. 9.15: Neural stack with recurrent network and controller. (a) Neural stack as 
recurrent network. (b) Neural stack recurrent network with controller 
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The current input i, is concatenated with previous read of the stack r;_; to the 
controller which has its own previous state h,_; generating next state h, and the 
output 0}. The output 0, results in the push signal scalar d,, the pop signal scalar u;, 
and the value vector v; that go as input signals to the neural stack and the output 
signal o; for the whole. The equations are: 


d, = sigmoid(W,o0, + bg) (9.74) 
u, = sigmoid(W, 0; + b,) (9.75) 
v, = sigmoid(W,o, + D,) (9.76) 
0; = sigmoid(W,0, + bo) (9.77) 


The whole structure can be easily adapted to neural queues by changing the pop 
signal to read from the bottom of the list rather than the top and can be written as: 


=A 
0,8, i] — 0,u,— ¥ s-1/ ifl<i<t 
spe ax ( S;—1|I| vax uy » ; ia)) i i (9.78) 
d; ifi=t 


I i-| 
r, = ) min (. li], max (o. 1- > si] -V: a) | (9.79) 
i=| j=l 


The Neural DeQue works similarly to a neural stack, but has the ability to take the 
input signals of push, pop, and value for both the top and the bottom sides of the 
list. 


9.3.7 Recurrent Entity Networks 


Henaff et al. designed a highly parallel architecture with a long dynamic memory 
which performs well on many NLU tasks known as recurrent entity networks (Ent- 
Net) [Hen+16]. The idea is to have blocks of memory cells, where each cell can 
store information about an entity in the sentence so that many entities correspond- 
ing to names, locations, and others have information content in the cells. We will 
discuss the core components of EntNet in Fig. 9.16. 


9.3.7.1 Input Encoder 


Let us consider specifically a question-answering system with sentences discussing 
the topic of interest, where the question and answer are both found in the given 
sentences, though this can be used for many other tasks. Let us consider a setup 
with training set as {(x;,y;)"_,}, x; is the input sentences, g is the question, and 
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y; 18 the single word answer. The input encoding layer transforms the sequence of 
words into a fixed length vector. This can be done as the authors describe using 
BOW representation and end states of RNN. They chose a simple representation 


given by using a set of vectors {f|,...,f;,} with the input embeddings of the words 
{e;,...,e,} for a given input at given time f: 
s, = ) f,0e; (9.80) 
i 


where © is the Hadamard product or element-wise multiplication. The same set of 
vectors {f|,...,f;,,} are used for all the time steps. The embedding matrix E € RIV|x4 
transforms each word in the sentence using E(w) =e € R¢, where d is the dimension 
of the embeddings. Like other parameters, the vectors { f,..., f,} are learned from 
the training data jointly with other parameters. 


9.3.7.2 Dynamic Memory 


As shown in Fig. 9.16b the input encoded sentences flow into blocks of memory 
cells and the whole network is a form of gated recurrent unit (GRU) with hidden 
states in these blocks, which concatenated together give the total hidden state of the 
network. The total blocks h),...,A are of the order 5—20 and each block h; has 
20-100 units. Each block j is given a hidden state h; € IR¢ and a key w j€ R¢. 

The role of a block is to capture information about an entity with the facts. This is 
accomplished by associating the weights of the key vectors with the embeddings of 
entities of interest so that the model learns information about the entities occurring 
within the text. A generic j block with weight w; and hidden state h; is given by: 


gi < sigmoid(s/ hi! -- siwi') (gate) (9.81) 
h’, — o(Phi- -- Qwi! + Rs, ) (candidate memory) (9.82) 
hi — hi! +g; oh’ (new memory) (9.83) 
h’ 
hi, — Tal (reset memory) (9.84) 
J 


where g; is the gate that decides how much of the memory will be updated, @ is 
the activation function like ReLU, h’ is the new memory that is combining the older 
timestamp with current, and the normalization in the last step helps in forgetting 
the previous information. The matrices P € R?*?,Q € R2*2,R € R@*4 are shared 


across all blocks. 
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Fig. 9.16: EntNet. (a) A single block in the EntNet. (b) Recurrent entity networks 
(EntNet) with multiple blocks 


9.3.7.3 Output Module and Training 


The output module when presented with question q creates a probability distribution 
over all the hidden states and the entire equations can be written as: 


pj = softmax(q'h;) (9.85) 
u= > pjhj (9.86) 
j 


y =Ro(q+Hu) (9.87) 
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The matrices R € R'Y!*4 and H € R¢*4 are again trained with the rest of the pa- 
rameters. The function @ adds the non-linearity and can be an activation like ReLU. 
The entire network is trained using backpropagation. The entities can be extracted 
as part of preprocessing, and the key vectors can be specifically tied to the embed- 
dings of the entities existing in the stories such as {Mary, Sandra, John, bathroom, 
bedroom, kitchen, garden, football} in the bAbI example. 


9.3.8 Applications of Memory Augmented Networks in Text and 
Speech 


Most memory networks have been successfully applied to complex NLU tasks 
such as question answering and semantic role labeling [WCB14, Suk+15, Gra+16, 
Hen+ 16]. Sukhbaatar et al. applied end-to-end memory networks to outperform tra- 
ditional RNNs by increasing memory hops in the language modeling task [Suk+ 15]. 
Kumar et al. have interestingly converted most NLP tasks from syntactic to seman- 
tic tasks in a question-answering framework and applied dynamic memory networks 
successfully [Kum-+16]. Grefenstette et al. showed significant performance gains, 
obtained using memory networks such as neural stacks, queues, and deques in trans- 
duction tasks such as inversion transduction grammars (ITG) used in machine trans- 
lation [Gre+15]. 


9.4 Case Study 


In this section, we explore two NLP topics: attention-based NMT and memory net- 
works for question and answering. Each topic follows the same format as used in 
previous chapters, and provides exercises in the end. 


9.4.1 Attention-Based NMT 


In this portion of the case study, we compare attention mechanisms on the English- 
to-French translation task introduced in Chap.7. The dataset used is composed of 
translation pairs from the Tatoeba website. This is the same dataset used in the 
Chap. 7 case study. 
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9.4.2 Exploratory Data Analysis 


For the EDA process, we refer the readers to Sect. 7.7.2 for the steps that were used 
to create the dataset splits. 
The dataset summary is shown below. 


i Training set size: 107885 
Validation set size: 13486 

3 Testing set size: 13486 

4 Size of English vocabulary: 4755 
5 Size of French vocabulary: 6450 


i) 


9.4.2.1 Software Tools and Libraries 


When we first explored neural machine translation, we used the fairseq library which 
leverages PyTorch. To the best of our knowledge, there is not a single library that 
supports all the different attention mechanisms. Therefore, we combine a collection 
of libraries to compare the attention approaches. Specifically, we use PyTorch as 
the deep learning framework, AllenNLP for most attention mechanism implemen- 
tations, spaCy for tokenization, and torchtext for the data loader. The code contained 
here extends some of the original work in the PyTorch tutorials with additional func- 
tionality and comparisons. 


9.4.2.2 Model Training 


We compare five different attention mechanisms, training for 100 epochs. For each 
attention mechanism the model that performs the best on the validation data is cho- 
sen to run on the testing data. The models trained are 4-layer bidirectional GRU 
encoders with a single unidirectional GRU decoder. The encoder and decoder both 
have a hidden size of 512 with the encoding and decoding embeddings having a size 
of 256. The models are trained with cross-entropy loss and SGD, with a batch size 
of 512. The initial learning rate is 0.01 for the encoder and 0.05 for the decoder, and 
momentum is applied to both with a value of 0.9. A learning rate schedule is used 
to reduce the learning rate when the validation loss hasn’t improved for 5 epochs. 

To regularize our model, we add dropout to both the encoder and decoder with a 
probability of 0.1 and the norms of the gradients are clipped at 10. 

We incorporate a batch implementation of the model to leverage the parallel com- 
putation of the GPUs. The architecture is the same for each of the models except for 
the Bahdanau model, which required introducing a weight matrix for the bidirec- 
tional output of the encoder in the attention mechanism. 

We define the different components of our networks as follows: 

i class Encoder(nn. Module): 


2 def __init__(self , input_dim , emb_dim, enc_hid_dim , 
dec_hid_dim , 
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dropout , num_layers=1, bidirectional=False): 
super().__init__() 


self .input_dim = input_dim 

self .emb_dim = emb_dim 
self.enc_hid_dim = enc_hid_dim 
self.dec_hid_dim = dec_hid_dim 
self.num-_layers = num_layers 

self. bidirectional = bidirectional 


self .embedding = nn.Embedding(input_dim , emb_dim) 
self.rnn = nn.GRU(emb_dim, enc_hid_dim, num_layers= 
num_layers, bidirectional=bidirectional ) 
self.dropout = nn.Dropout( dropout ) 
if bidirectional: 

self.fc = nn.Linear(enc_hid_dim x* 2, dec_hid_dim) 


def forward(self, src): 
embedded = self .dropout(self .embedding (src) ) 
outputs , hidden = self .rnn(embedded ) 


if self. bidirectional: 
hidden = torch.tanh(self.fc(torch. cat ((hidden[—2,:,:], 
hidden[ —1,:,:]), dim=1))) 


if not self.bidirectional and self.num_layers > I: 
hidden = hidden[-—1,:,:] 


return outputs , hidden 


class Decoder(nn. Module): 


def __init__(self , output_dim , emb_dim, enc_hid_dim , 
dec_hid_dim , dropout, 

attention , bidirectional_input=False): 
super().__init__() 


self .emb_dim = emb_dim 

self.enc_hid_dim = enc_hid_dim 
self.dec_hid_dim = dec_hid_dim 

self.output_dim = output_dim 

self.dropout = dropout 

self.attention = attention 

self. bidirectional_input = bidirectional_input 


self .embedding = nn.Embedding(output_dim , emb_dim) 


if bidirectional_input: 

self.rnn = nn.GRU((enc_hid_dim x* 2) + emb_dim, 
dec_hid_dim ) 

self.out = nn. Linear((enc_hid_dim *« 2) + dec_hid_dim + 
emb_dim, output_dim ) 
else: 

self.rnn = nn.GRU((enc_hid_dim) + emb_dim, dec_hid_dim 
) 

self.out = nn.Linear((enc_hid_dim) + dec_hid_dim + 
emb_dim, output_dim) 
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self.dropout = nn.Dropout( dropout ) 


def forward(self , input, hidden, encoder_outputs ): 


input = input.unsqueeze (0) 

embedded = self .dropout(self .embedding (input ) ) 

hidden = hidden.squeeze(Q) if len(hidden.size()) > 2 else 
hidden # batch_size=l issue 


# Repeat hidden state for attention on bidirectional 
outputs 
if hidden.size(—1) != encoder_outputs.size(—l): 

attn = self.attention(hidden.repeat(l, 2), 
encoder_outputs.permute(l, O, 2)) 


meee = self.attention(hidden, encoder_outputs .permute 
(i O05 2)) 

a = attn.unsqueeze (1) 

encoder_outputs = encoder_outputs.permute(l, 0, 2) 
weighted = torch.bmm(a, encoder_outputs ) 


weighted = weighted.permute(l, O, 2) 

rnn_input = torch.cat((embedded, weighted), dim=2) 
output, hidden = self.rnn(rnn_input , hidden.unsqueeze(Q) ) 
embedded = embedded. squeeze (0) 

output = output. squeeze (0) 


weighted = weighted. squeeze (0) 


output = self.out(torch.cat(( output , weighted, embedded), 
dim=1) ) 


return output, hidden.squeeze(Q), attn 


class Seq2Seq(nn. Module) : 
det -sinilte=( self wesecicoder w decoder device) = 


Super().__init__() 
self.encoder = encoder 
self.decoder = decoder 
self.device = device 


def forward(self, sre, trg, teacher_forcing_ratio =0.5): 


batch_size = src.shape[l1] 
max_len = trg.shape[0] 
trg_vocab_size = self .decoder.output_dim 


outputs = torch.zeros(max_len, batch_size, trg_vocab_size) 
.to(self.device ) 


encoder_outputs , hidden = self .encoder(src ) 
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16 hidden = hidden. squeeze (1) 
18 output = trg[0O,:] # first input to decoder <sos> 
20 for t in range(l, max_len): 


21 output, hidden, attn = self.decoder(output, hidden, 
encoder_outputs ) 


22 outputs[t] = output 

2 teacher_force = random.random() < teacher_forcing_ratio 
24 topl = output.max(l1)[1] 

25 output = (trg[t] if teacher_force else topl) 


26 
27 return outputs 

We use the attention implementations from AllenNLP for dot product, cosine, 
and bilinear attention. These functions take the hidden state of the decoder and the 
output of the encoder and return the attended scores. 


i from allennlp.modules.attention import LinearAttention , 


2 CosineAttention , 
3 BilinearAttention , 
4 DotProductAttention 


6 attn = DotProductAttention() # Changed for each type of model 
7 enc = Encoder(INPUT_DIM, 

8 ENC_EMB_DIM , 

9 ENC_HID_DIM , 

0 DEC_HID_DIM , 

| ENC_DROPOUT, 

2 num_layers=ENC_NUM_LAYERS, 

3 bidirectional =ENC_BIDIRECTIONAL ) 

4 dec = Decoder(OUTPUT_DIM, 





5 DEC_EMB_DIM , 
6 ENC_HID_DIM , 
7 DEC_HID_DIM , 
8 DEC_DROPOUT , 
9 attn , 


20 bidirectional_input=ENC_BIDIRECTIONAL ) 


2 model = Seq2Seq(enc, dec, device).to(device ) 


In Figs. 9.17 and 9.18 we show the training graphs for the loss and PPL, respec- 
tively, for each of the attention models. The three methods that perform the best are 
Bahdanau, dot product, and bilinear models. Cosine and linear attention struggle to 
converge. The attention mechanism in linear attention specifically does not correlate 
with the input sequence at all. 

In Figs. 9.19, 9.20, 9.21, 9.22, and 9.23, we give some examples of the decoded 
attention outputs for three different files, showing what the decoder is attending to 
during the translation process. In each of the figures, the first two graphs (a) and 
(b) are inputs with lengths of 10, the maximum seen by the models during train- 
ing. In most cases, the attention still aligns with the input; however, the predictions 
are mostly incorrect, typically with high entropy near the time steps close to the 
maximum training sequence length. 
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Attention Comparison: Training Loss 


=—— bahdanau_train_loss 
== bilinear_train_loss 
—— cosine_train_loss 
—— dot_train_loss 

== linear_train_loss 





(a) 


Attention Comparison: Validation Loss 


=—— bahdanau_val_loss 








\ —— bilinear_val_loss 
5.0 Ni — cosine_val_loss 
1 — dot_val_loss 
\N =—— linear_val_loss 
4.5 
4.0 
He. 
30 
25 
0 20 40 60 80 


(b) 


Fig. 9.17: (a) Training and (b) validation losses for each attention model 


9.4.2.3 Bahdanau Attention 


The Bahdanau attention employs a fully connected layer to combine the concate- 
nated outputs of the bidirectional layer, rather than duplicating the hidden state. 
Incorporating this requires slight alterations to accommodate the changes in tensor 
SIZES. 


i class BahdanauEncoder(nn. Module): 

3 def __init__(self , input_dim , emb_dim, enc_hid_dim , 
dec_hid_dim , dropout): 

3 super()._-_init__() 
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Attention Comparison: Training PPL 


—— bahdanau_train_ppl 
—— bilinear_train_ppl 
300 { —— cosine_train_ppl 


—— dot_train_ppl 
—— linear_train_ppl 


250 

200 
150 
100 


50 





(a) 


Attention Comparison: Validation PPL 


—— bahdanau_val_ppl 
—— bilinear_val_ppl 
=—— cosine_val_ppl 
—— dot_val_ppl 

—— linear_val_ppl 





(b) 


Fig. 9.18: (a) Training and (b) validation PPL for each attention model 


self.input_dim = input_dim 
self .emb_dim = emb_dim 
self.enc_hid_dim = enc_hid_dim 
self .dec_hid_dim = dec_hid_dim 
self.dropout = dropout 


self.embedding = nn.Embedding(input_dim , emb_dim) 


self.rnn = nn.GRU(emb_dim, enc_hid_dim, num_layers=4, 
bidirectional=True ) 
self.fc = nn.Linear(enc_hid_dim «x 2, dec_hid_dim) 


self.dropout = nn.Dropout(dropout ) 
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Fig. 9.19: Dot product attention examples 
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Fig. 9.20: Cosine attention examples 
15 
16 def forward(self, src): 
17 embedded = self.dropout(self .embedding (src ) ) 
18 outputs , hidden = self .rnn(embedded ) 
19 hidden = torch.tanh(self.fc(torch.cat((hidden[—2,:,:], 
hidden[ —1,:,:]), dim=1))) 
20 return outputs , hidden 


Minor alterations are made to the decoder to handle the difference between the hid- 
den size and the encoder output. 


class BahdanauAttention(nn. Module): 
def __init__(self , enc_hid_dim , dec_hid_dim): 
super().__init__() 


self.enc_hid_dim = enc_hid_dim 
self .dec_hid_dim dec_hid_dim 


1 
2 
3 
4 
5 
6 
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Fig. 9.22: Linear attention examples. Note how the model was unable to learn a 
useful mapping from the attention mechanism while still being able to translate 
some examples 


self.attn = nn.Linear((enc_hid_dim x 2) + dec_hid_dim , 


dec_hid_dim ) 
self.v = nn. Parameter(torch.rand(dec_hid_dim ) ) 


def forward(self , hidden, encoder_outputs ): 


batch_size = encoder_outputs.shape[l1] 
src_len = encoder_outputs. shape [0] 


hidden = hidden.unsqueeze(1).repeat(l, src_len, 1) 
encoder_outputs = encoder_outputs.permute(l1, O, 2) 
energy = torch.tanh(self.attn(torch.cat((hidden , 


encoder_outputs), dim=2)) ) 
energy = energy.permute(0, 2, 1) 
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21 








22 v = self.v.repeat(batch_size , 1).unsqueeze(1) 
23 
24 attention = torch.bmm(v, energy).squeeze(1) 
25 return F.softmax(attention , dim=1) 
A © A 
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Fig. 9.23: Bahdanau attention examples 


9.4.2.4 Results 


The best performing model for each of the attention mechanisms was run on the 
testing set to produce the following results in Table 9.2. 









Attention type} Loss | PPL 
Dot 17.826|2.881 
Bilinear 13.987 | 2.638 
Cosine 22.098 | 3.095 
Linear 17.918 |2.886 
Bahdanau [{17.580}2.867 


Table 9.2: Test results for attention models. Best results are shown in bold 


Bilinear attention performed the best in this experiment. We can see from the 
attention alignments in Fig. 9.21 that the attention output is strongly correlated with 
the input. Moreover, the strength of attention is highly confident throughout the pre- 
diction sequence, ever so slightly losing confidence towards the end of the sequence. 
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9.4.3 Question and Answering 


To help the reader familiarize themselves with the attention and memory networks, 
we will apply the concepts of this chapter to the question-answering task with the 
bAbI dataset. The bADI is a collection of 20 simple QA tasks with limited vocab- 
ulary. For each task, there is a set of 1000 training and 1000 stories, test questions 
and answers as well as an extended training set with 10,000 samples. Despite its 
simplicity, bAbI effectively captures the complexities of memory and long-range 
dependencies in question answering. For this case study, we will focus on tasks 1- 
3, consisting of questions where up to three supporting facts from the stories provide 
information to support the answer. 


9.4.3.1 Software Tools and Libraries 


We will implement several architectures with Keras and TensorFlow for this case 
study. Keras provides a useful example recurrent neural network architecture for 
the question-answering task that will serve as our baseline. We will contrast per- 
formance with several of the memory network-based architectures discussed in this 
chapter, including a differentiable neural computer model from DeepMind. Rather 
than providing full coverage of each architecture here, we direct the reader to the 
notebooks accompanying this chapter for full implementation details. 


9.4.3.2 Exploratory Data Analysis 


Our first step is to download the bAbI dataset and to extract the training and test sets 
for our analysis. We will focus on the extended dataset with 10,000 training samples 
and 1000 test samples. Let’s first take a quick look at the samples for tasks QAI, 
QA2, and QA3: 


QAI Story: | Mary moved to the bathroom. John went to the hallway. 
QAI Query: Where is Mary? 
QAI! Answer: bathroom 


QA2 Story: | Mary moved to the bathroom. Sandra journeyed to the bedroom. Mary got 
the football there. John went to the kitchen. Mary went back to the kitchen. 
Mary went back to the garden. 

QA2 Query: Where is the football? 

QA2 Answer: garden 


QA3 Story: | Mary moved to the bathroom. Sandra journeyed to the bedroom. Mary got 
the football there. John went back to the bedroom. Mary journeyed to the 
office. John journeyed to the office. John took the milk. Daniel went back 
to the kitchen. John moved to the bedroom. Daniel went back to the hall- 
way. Daniel took the apple. John left the milk there. John travelled to the 
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kitchen. Sandra went back to the bathroom. Daniel journeyed to the bath- 
room. John journeyed to the bathroom. Mary journeyed to the bathroom. 
Sandra went back to the garden. Sandra went to the office. Daniel went to 
the garden. Sandra went back to the hallway. Daniel journeyed to the office. 
Mary dropped the football. John moved to the bedroom. 

QA3 Query: Where was the football before the bathroom? 

QA3 Answer: office 


Analysis of the datasets shows the increasing complexity and long-range memory 
that is required when progressing from task QA1 to QA3. The distribution of story 


Task train_stories test_stories min(story_size) max(story_size) query_size vocab_size 


QAI — 10,000 1000 12 68 4 pall 
QA2 10,000 1000 12 2 5 aD 
QA3 10,000 1000 22 1875 8 36 


lengths and question lengths (in terms of the number of tokens) can be seen in 
Fig. 9.24. 

The average length of the stories increases substantially from tasks QA1 to QA3, 
which makes it significantly more difficult. Remember that for task QA3, there are 
only three support facts and most of the story is considered “noise.” We will see how 
well different architectures are able to learn to identify the relevant facts from this 
noise. 


9.4.3.3 LSTM Baseline 


We use the Keras LSTM architecture example to serve as our baseline. This archi- 
tecture consists of the following: 


1. The tokens of each story and question are mapped to embeddings (that are not 
shared between them). 

2. The stories and questions are encoded using separate LSTMs. 

. The encoded vectors for the story and question are concatenated. 

4. These concatenated vectors are used as an input to a DNN whose output is a 
softmax over the vocabulary. 

5. The entire network is trained to minimize the error between the softmax output 
and the answer. 


oY) 


The Keras model that implements this architecture 1s: 





i|RNN = recurrent .LSTM 


3} Sentence = layers.Input(shape=(story_maxlen ,), dtype=’int32’) 
4 encoded_sentence = layers .Embedding(vocab-size , 
EMBED_HIDDEN_SIZE) (sentence ) 
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Story Length Distribution 
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Fig. 9.24: Distributions of story and question lengths in bAbI tasks 1-3 


encoded_sentence = Dropout(0.3) (encoded_sentence ) 
encoded_sentence = RNN(SENT_HIDDEN_SIZE, 
return_sequences=False)(encoded_sentence ) 


question = layers.Input(shape=(query_maxlen ,) , dtype=’int32’) 


encoded_question = layers .Embedding(vocab-_size , 
EMBED_HIDDEN_SIZE) ( question ) 
encoded_question = Dropout (0.3) (encoded_question ) 


encoded_question = RNN(QUERY_HIDDEN_SIZE, 
return_sequences=False )(encoded_question ) 


merged = layers.concatenate ([ encoded_sentence , encoded_question 
Ip) 

merged = Dropout (0.3) (merged ) 

preds = layers .Dense(vocab-size , activation=’ softmax’ )(merged) 
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io} model = Model([ sentence , question], preds) 
2| model. compile(optimizer=’adam’ ,loss=’ categorical_crossentropy ’ 
piietrics=| accuracy ||) 





We train this model using the extended bAbI training sets with 50-dim embeddings, 
100-dim encodings, batch size of 32, and the Adam optimizer for 100 epochs. The 
performance on tasks QAI, QA2, and QA3 is given in Table 9.3. As seen in the 


Table 9.3: Baseline LSTM performance 


Task Test set accuracy 


QAI 0.51 
QA2 0.31 
QA3 0.17 


results, the longer the stories, the worse the performance of the LSTM model due to 
the increased “noise” in the data. 


9.4.3.4 End-to-End Memory Network 


Memory networks offer the opportunity to store long-term information and thereby 
improve performance, especially on longer sequences such as task QA3. Memory 
networks are able to store supporting facts as memory vectors which are queried 
and used for prediction. In the original form by Weston, the memory vectors are 
learned via direct supervision with hard attention and supervision is required at each 
layer of the network. This requires significant effort. To overcome this need, end- 
to-end memory networks as proposed by Sukhbaatar use soft attention in place of 
supervision that can be learned during training via backpropagation. This end-to-end 
architecture takes the following steps: 


1. Each story sentence and query are mapped to separate embedding representa- 
tions. 

2. The query embedding is compared with the embedding of each sentence in the 
memory, and a softmax function is used to generate a probability distribution 
analogous to a soft attention mechanism. 

3. These probabilities are used to select the most relevant sentence in memory using 
a separate set of sentence embeddings. 

4. The resulting vector is concatenated with the query embedding and used as input 
to an LSTM layer followed by a dense layer with a softmax output. 

5. The entire network is trained to minimize the error between the softmax output 
and the answer. 


Note that this is termed a 1-hop or single-layered MemN2N, since we query the 
memory only once. As described earlier, memory layers can be stacked to improve 
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performance, especially where multiple facts are relevant and necessary to predict 
the answer. The Keras implementation of the architecture is given below. 





1 





input_sequence = Input((story_maxlen ,) ) 
input_encoded_m = Embedding (input_dim=vocab-_size , 
output_dim=EMBED_HIDDEN_SIZE ) ( 
input_sequence ) 
input_encoded_m = Dropout (0.3) (input_encoded_m ) 


input_encoded_c = Embedding(input_dim=vocab-size , 
output_dim=query_maxlen ) ( 
input_sequence ) 
input_encoded_c = Dropout(0.3) (input_encoded_c ) 


question = Input ((query_maxlen ,) ) 
question_encoded = Embedding(input_dim=vocab-size , 
output_dim=EMBED_HIDDEN_SIZE, 
input_length=query_maxlen ) ( 
question ) 
question_encoded = Dropout(0.3) (question_encoded ) 


match = dot([input_encoded_m, question_encoded], axes=(2, 2)) 
match = Activation(’ softmax’ )(match) 


response = add([match, input_encoded_c ]) 
response = Permute((2, 1))(response) 


answer = concatenate ([response , question_encoded ]) 


23| answer = LSTM(BATCH_SIZE) (answer ) 


answer = Dropout(0.3) (answer ) 
answer = Dense(vocab-_size ) (answer ) 
answer = Activation(’softmax’ )( answer ) 


model = Model([input_sequence , question], answer) 
model.compile(optimizer=’adam’, loss=’ 
Sparse Categorical crossemiropy 
metrics=[’ accuracy ’ ]) 


We train this single-layered model using the extended bAbI training sets with 50- 
dim embeddings, batch size of 32, and the adam optimizer for 100 epochs. The 
performance on tasks QA1, QA2, and QA3 is given in Table 9.4. In comparison to 


Table 9.4: End-to-end memory network performance 


Task Accuracy (20 epochs) Accuracy (100 epochs) 


QAI 0.53 0.92 
QA2 0.39 0.35 
QA3 0.15 0.21 


the baseline LSTM, the MemN2N model did significantly better for all three tasks, 
and especially for QA1. 
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9.4.4 Dynamic Memory Network 


As discussed earlier, dynamic memory networks take memory networks one step 
further and encode memories using a GRU layer. An episodic memory layer is the 
key to dynamic memory networks, with its attention mechanisms for feature gener- 
ation and scoring. Episodic memory is composed of two nested GRUs, where the 
inner GRU generates the episodes and the outer GRU generates the memory vector 
from the sequence of episodes. DMNs follow the following steps: 


1. The input story sentences and query are encoded using GRUs and passed to the 
episodic memory module. 

2. Episodes are generated by attending over these encodings to form a memory 
such that sentence encodings with low attention scores are ignored. 

3. Episodes along with previous memory states are used to update the episodic 
memory. 

4. The query and memory states serve as inputs to the GRU within the answer 
module which is used to predict the output. 

5. The entire network is trained to minimize the error between the GRU output and 
answer. 


A TensorFlow implementation of the episodic memory module for a dynamic mem- 
ory network is provided below. Note that EpisodicMemoryModule depends on a 
soft attention GRU implementation, which is included in the case study code. 





i| class EpisodicMemoryModule( Layer): 


# attention network 


4 self.1_.1 = Dense(units=emb_dim, batch_size=batch_size , 
activation=’tanh’ ) 
5 self.1.2 = Dense(units=l1,batch_size=batch-_size , 


activation=None ) 


7 # Episode network 

8 self .episode_GRU = SoftAttnGRU(units=units , 

9 return_sequences=False , 
0 batch_size=batch_size ) 








| 

2 # Memory generating network 

3 self.memory_net = Dense(units=units , activation=’ relu ’) 
4 

5 for step in range(self.memory-steps): 

6 attentions = [tf .squeeze( 

7 compute_attention(fact , question , memory), 

axis=1) 

8 for 1, Pact in enumerate( fact list) | 

9 attentions = tf.stack(attentions ) 
20 attentions = tf.transpose(attentions ) 
21 attentions = tf.nn.softmax(attentions ) 


22 attentions = tf.expand_dims(attentions , axis=—l) 
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24 episode = K.concatenate([facts , attentions], axis 
25 episode = self .episode_GRU (episode ) 


27 memory = self .memory_net(K. concatenate ([memory , 
episode , question], axis=1)) 


29 return K.concatenate ([memory, question], axis=1) 





We train a DMN model using the extended bAbI training sets with 50-dim GloVe 
embeddings, batch size of 50, 100 hidden units, 3 memory steps, and the adam 
optimizer for just 20 epochs. The performance on tasks QA1, QA2, and QA3 is 
given in Table 9.5. In comparison with earlier architectures, we can see that dynamic 


Table 9.5: Dynamic memory network performance 


Task Test set accuracy 


QAI 1.00 
QA2 0.47 
QA3 0.29 


memory networks perform better than MemN2N and LSTM networks for all three 
tasks, reaching perfect prediction on task QAI. 


9.4.4.1 Differentiable Neural Computer 


The differentiable neural computer (DNC) is a neural network with an independent 
memory bank. It is an embedded neural network controller with a collection of pre- 
set operations for memory storage and management. As an extension of the neural 
Turing machine architecture, it allows for scaling of memory without having to scale 
the rest of the network. 

The heart of a DNC 1s a neural network called a controller, which is analogous 
to a CPU in a computer. This DNC controller can perform several operations on 
memory concurrently, including reading and writing to multiple memory locations 
at once and producing output predictions. As before, the memory is a set of loca- 
tions that can each store a vector of information. The DNC controller can use soft 
attention to search memory based on the content of each location, or associative 
temporal links can be traversed forward or backward to recall sequence information 
in either direction. Queried information can then be used for prediction. 

For a given input at each time step, the DNC controller outputs four vectors: 


read vector/s: used by the read head/s to address memory locations 
erase vector/s: used to selectively erase items from memory 

write vector/s: used by the write heads to store information in memory 
output vector: used as a feature for output prediction 
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For this case study, we will apply the TensorFlow-DNC implementation developed 
by DeepMind to the bAbI extended datasets. The DNC module for this implemen- 
tation is given by: 





i| DNCState = collections .namedtuple(’DNCState’, (’access_output’ 


2 Paceess=state 


) 


Ccontrollersstate ).) 
4 class DNC(snt.RNNCore): 





5 # modules 

6 self._controller = snt.LSTM(x*«x controller_config ) 

7 self._access = access.MemoryAccess(**x access_config ) 

8 

9 # output 

0 prev_access_output = prev-state.access_output 

prev_access-_state = prev_state.access-_state 

2 prev_controller_state = prev-_state.controller_state 

4 batch_flatten = snt.BatchFlatten () 

5 controller_input = tf.concat([batch_flatten(inputs), 

6 batch_flatten ( 
prev_access_output)], 1) 

7 

8 controller_output , controller_state = self. _controller( 





controller_input , prev_controller-state ) 


20 access_output , access_state = self.-_access ( 
controller_output , prev_access-_state ) 


22 output = tf.concat([controller_output , batch _flatten ( 
access_output)], 1) 

23 output = snt.Linear(output_size=self._output_size.as_list 
OnE 

24 name=’ output_linear’)(output ) 





We train a DNC model using the extended bAbI training sets with 50-dim GloVe 
embeddings, hidden size of 256, memory size of 256 x 64, 4 read heads, | write 
head, batch size of 1, and the RMSprop optimizer with gradient clipping for 20,000 
iterations. The performance on tasks QAI, QA2, and QA3 is given in Table 9.6. It 


Table 9.6: Differentiable neural computer performance 


Task Test set accuracy 


QAI 1.00 
QA2 0.67 
QA3 0.55 


may not be surprising to see that the DNC model outperforms all previous models, 
given the increased complexity. The trade-off between accuracy and training time 
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should be carefully weighed when choosing which architecture is most suitable for 
the task. For simple tasks, a single LSTM implementation may be all that is required. 
DNCs with their scalable memory are a better choice when complex knowledge is 
required for task prediction. 


9.4.4.2 Recurrent Entity Network 


Recurrent entity networks (EntNets) incorporate a fixed bank of dynamic memory 
cells that allow simultaneous location and content-based updates. Because of this 
ability, they perform very well and set the state-of-the-art in reasoning tasks such as 
bAbI. Unlike the DNC which relies on a sophisticated central controller, EntNet is 
essentially a set of separate, parallel recurrent memories with independent gates for 
each memory. 

The EntNet architecture consists of an input encoder, a dynamic memory, and an 
output layer. It operates with the following steps: 


1. The input story sentences and query are mapped to embedding representations 
and passed to the dynamic memory layer and output layer, respectively. 

2. Key vectors with the embeddings of entities are generated. 

3. The hidden states (memories) of the set of gated GRU blocks within the dynamic 
memory are updated over the input encoder vectors and key vectors. 

4. The output layer applies a softmax over the query g and hidden states of the 
memory cells to generate a probability distribution over the potential answers. 

5. The entire network is trained to minimize the error between the output layer 
candidate and answer. 


The architecture of the dynamic memory cell written in TensorFlow is provided 
below: 





i| class DynamicMemoryCell(tf.contrib.rnn.RNNCell): 


2 def get_gate(self , state_j , key_j, inputs): 

3 a = tf.reduce_sum(inputs * state_j , axis=1) 

4 b = tf.reduce_sum(inputs *« key_j, axis=1) 

5 return tf.sigmoid(a + b) 

6 

7 def get_candidate(self , state_j , key_j, inputs, U, V, W, 
U_bias): 

8 key_V = tf.matmul(key_j, V) 


9 state_U = tf.matmul(state_j , U) + U_bias 
0 inputs_W = tf.matmul(inputs , W) 
return self._activation(state_U + inputs_W + key_V) 


def __call__(self , inputs, state): 





4 State = tf.split(state , self._num_blocks, axis=1) 
5 Me MUostate ss — all] 

6 1Or |, State _j im enumerate (State ): 

17 key_j = tf.expand_dims(self._keys[j], axis=0) 
8 gate_j| = self .,get_pate (state_} , key), inputs ) 





9 candidate_j = self.get_candidate(state_]j , 
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20 key-]j, 

21 inputs , 

22 U, V, W, U_bias) 

23 Sstate_j_next = state_j + tf.expand_dims(gate_] , 
—1l) * candidate_j 

24 state_j_next_norm = tf.norm(tensor=state_j_next , 

25 ord= eclclidean = 

26 axis==— 1. 

27 keep_dims=True ) 

28 state_j_next_norm = tf.where(tf. greater ( 
state_j_next_norm, 0.0), 

29 state_j_next_norm , 


30 tf. ones_like ( 
state_j_next_norm) ) 

31 State_j_next = state_j_next / state_j_next_norm 

32 next_states .append(state_j_next ) 

3 Siate next = Uf coneat( Mext staves . acs —1)) 

34 return State _Mext , State_next 





We train an EntNet using the extended bADI training set with 100-dim embeddings, 
20 blocks, batch size of 32, and the ADAM optimizer with gradient clipping for 
200 epochs. The performance on tasks QAI, QA2, and QA3 is given in Table 9.7. 
The performance of our implementation on bAbI tasks QA1, QA2, and QA3 ex- 


Table 9.7: EntNet performance 


Task Test set accuracy 


QAI 1.00 
QA2 0.97 
QA3 0.90 


ceeds all previous architectures. Note that with proper hyperparameter tuning, the 
performance of EntNet and the previous architectures can be improved on the bAbI 
tasks. 


9.4.5 Exercises for Readers and Practitioners 


The readers and practitioners can consider extending the case study to the following 
problems in order to expand their knowledge: 


1. Memory and complexity can be limited when using the same embedding matrix 
for both the encoder and decoder. What would need to change to address this 
problem? 

2. Tune and increase the number of epochs for the baseline LSTM model during 
training. Does adding dropout help’? 


460 


9 Attention and Memory Augmented Networks 


3. Add a second and third hop to the end-to-end memory network and see if perfor- 
mance improves on bAbI tasks QA2 and QA3. 

4. How does restricting the size of the memory representation affect performance? 

5. Is there a significant effect by using a different similarity scoring function instead 
of the softmax within the memory controller of aMemN2N network? 

6. Explore the architectures in this case study on bADI tasks 3-20. Does the simple 
baseline LSTM outperform on certain tasks? 
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Chapter 10 
Transfer Learning: Scenarios, cet 
Self-Taught Learning, and Multitask 

Learning 





10.1 Introduction 


Most supervised machine learning techniques, such as classification, rely on some 
underlying assumptions, such as: (a) the data distributions during training and pre- 
diction time are similar; (b) the label space during training and prediction time are 
similar; and (c) the feature space between the training and prediction time remains 
the same. In many real-world scenarios, these assumptions do not hold due to the 
changing nature of the data. 

There are many techniques in machine learning to address these problems, 
such as incremental learning, continuous learning, cost-sensitive learning, semi- 
supervised learning, and more. In this chapter, we will focus mainly on transfer 
learning and related techniques to address these issues. 

DARPA defines transfer learning as the ability of the system to learn and ap- 
ply knowledge from previous tasks to new tasks [Dar05]. This research gave rise 
to many successes in various domains for 7—10 years using mostly traditional ma- 
chine learning algorithms with transfer learning as the focus. This research impacted 
various domains, such as wireless telecommunications, computer vision, text min- 
ing, and many others [Fun+06, DM06, Dai+07b, Dai+07a, TSO7, Rai+07, JZ07, 
BBS07, Pan+08, WSZ08]. 

As the deep learning field is evolving rapidly, the main focus these days is on 
unsupervised and transfer learning. We can classify transfer learning into various 
sub-fields, such as self-taught learning, multitask learning, domain adaptation, zero- 
shot learning, one-shot learning, few-shot learning, and more. In this chapter, we 
will first go over the definitions and fundamental scenarios of transfer learning. We 
will cover the techniques involved in self-taught learning and multitask learning. In 
the end, we will carry out a detailed case study with multitask learning using NLP 
tasks to get hands-on experience on the various concepts and methods related in this 
chapter. 
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10.2 Transfer Learning: Definition, Scenarios, and 
Categorization 


As shown in Fig. 10.1, in traditional machine learning, different models need to be 
learned for different sources (data and labels). Figure 10.1 shows that for a source 
(task or domain) with training data and labels the system learns models (model A 
and model B) that are only effective on targets (task or domain) that are similar to 
the source, respectively, learned by each model. In most cases, the model learned 
for a specific source cannot be used for predicting on a target that is different. If 
there is a model which requires a large number of training data, then the effort of 
collecting data, labeling the data, training the models, and validating the models has 
to be done per source. This effort becomes unwieldy with a large number of systems 
from a cost and resource perspective. 

Figure 10.2 shows a general transfer learning system which can extract knowl- 
edge from the source system or the model and transfers it in some way so that it 
can be useful on a target. This model A trained for a task using training data for 
source A can be used to extract knowledge and transfer it to another target task. 


Training Source 









(D Target) 


Fig. 10.1: Traditional machine learning system on two different sources and target 
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Fig. 10.2: Transfer learning system on different source and target 


10.2.1 Definition 


In order to define transfer learning precisely, we will first define a couple of concepts 
as given by Pan and Yang, i.e., domains and tasks [PY 10]. A domain D = (X, P(X)) 
is defined in terms of (a) the feature space X and (b) the marginal probability dis- 
tribution P(X), where X represents the training data samples X = x1,x2...xX, € X. 
For example, in the task of sentiment analysis with binary classification, the X cor- 
responds to a bag-of-words representation and x; corresponds to the ith term in the 
corpus. Thus, when either the feature spaces or the marginal probability distribution 
are different for two systems, we say that the domains do not match. 

A task J = (Y, f(-)) is defined in terms of (a) a label space Y and (b) an objective 
prediction function f(-) that is not directly observed but learned from the input and 
label pairs (x;,y;). The label space consists of a set of all actual labels, for example, 
true and false for binary classification. The objective prediction function f(-) is 
used to predict the label given the data and can be interpreted in probabilistic view 
as f (+) © p(y|x). 

Given a source domain Ds, source task Js, target domain D7, and target task 
Jr, transfer learning can be defined as the process of learning the target predictive 
function fr(-) = P(Yr|X7) in the target domain Dr using the knowledge from the 
source domain Ds and the source task Js, such that Ds 4 Dr or Ts 4 Tr. 
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10.2.2 Transfer Learning Scenarios 


Based on the different components of domain and task for both source and target, 
there are four different transfer learning scenarios that are listed below: 


1. Feature spaces are different, X;5 #4 X7. An example of this for sentiment clas- 
sification would be that the features are defined for two different languages. In 
NLP, this term is often referred to as cross-lingual adaptation. 

2. Marginal probability distributions between source and target are different, 
P(Xs) # P(Xr), for example, a chat text with short forms and an email text 
with formal language both discussing sentiments. 

3. Label spaces between source and target are different, Ys ~ Y7. This really 
means that the source and target tasks are completely different, for example, 
one can have labels corresponding to sentiments (positive, neutral, negative), 
and the other corresponding to emotions (angry, sad, happy). 

4. The predictive function or conditional probability distributions are different, 
P(Ys|Xs) # P(Yr|X7). An example of this is how the distribution in one can be 
balanced, and in the other completely skewed or highly imbalanced; the source 
has equal cases of positive and negative sentiments, but the target has very few 
positives as compared to negatives. 


10.2.3 Transfer Learning Categories 


Based on “how to transfer” and “what to transfer” between the source and target, 
transfer learning can be further categorized into many different types of which many 
have become an independent field for research and applications. In this section, 
we will not cover many traditional machine learning based categorizations already 
defined in the survey by Pan and Yang [PY 10]. Instead, we will cover only those 
categories that have been explored or made an impact in the deep learning area. 

Based on the label availability and task similarities between the source and target, 
there can be various sub-categories of transfer learning, as shown in Fig. 10.3. 

When the source labels are unavailable, but a large volume of source data exists 
and few to large numbers of target data exist, then the category of learning is known 
as Self-taught learning. Many real-world applications in speech and text, where the 
cost or the effort of labeling poses constraints, and the large volume of data can be 
used to learn and transfer to specific tasks with labels, this technique has been very 
successful. Employing some form of unsupervised learning on the source to capture 
features that can help transfer knowledge to the target is the core assumption made 
in these learning systems. 

When the goal is not only to do well on the target tasks but somehow learn jointly 
and do well in both source and target, where the tasks are slightly different, the 
form of transfer learning is called multitask learning. The core assumption made is 
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that sharing information among related tasks, which should have some similarities, 
improves the overall generalization. 

Related to multitask learning, where the tasks between source and target differ, 
domain adaptation is a form of learning where the domains (.e., either the feature 
space or marginal distribution in data) are different between source and the target. 
The core principle is to learn a domain-invariant representation from the source that 
can be transferred to the target with a different domain in an effective manner. 

In domain adaptation, the domains differ and small to large labeled data is avail- 
able in source. Domain adaptation can be zero-shot, one-shot, and few-shot learn- 
ing, based on the available number of labeled data (0, 1,7). 
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Fig. 10.3: Transfer learning categories based on labeled data, tasks, and domains for 
source and target 


10.3 Self-Taught Learning 


Self-taught learning, as shown in Fig. 10.4, consists of two distinct steps: (a) learn- 
ing features in an unsupervised manner from the unlabeled source dataset and (b) 
tuning these learned features with a classifier on the target dataset which has labels. 
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Fig. 10.4: Self-taught learning using pre-training and fine-tuning steps. (a) Using 
unlabeled source dataset to learn features. (b) Using the labeled target data to fine- 
tune the features with a classifier 


10.3.1 Techniques 


In this section we will summarize various approaches and then discuss specific al- 
gorithms or techniques that have been successful in NLP and speech. 


10.3.1.1 Unsupervised Pre-training and Supervised Fine-Tuning 


Algorithm 1: Unsupervised feature learning 


Data: Training Dataset xj(5),X2(5),-+ ,X,(s) Such that xj5) € R¢, layers= L 
Result: Weight matrix W,; € IR? and b) € R for each layer / 
begin 


appendClassifierLayer(h, ) 
for! =ktoLdo 
| Wid) = trainUnsupervised((X1 5) ,X2(s)--Xn(s))) 


return W_,; for each layer / 


The input to Algorithm 1 is the source unlabeled dataset of size n; the (S) in 
the subscript denotes the source. The first part of learning proceeds in an unsuper- 
vised way from the source, as shown in Algorithm 1. This has many similarities 
to feature or dimensionality reduction and manifold learning in traditional machine 
learning. This process generally employs linear and non-linear techniques to find a 
latent representation of the input that has a smaller dimension than the input. In deep 
learning, the train in the algorithm above corresponds to many unsupervised tech- 
niques such as PCA or ICA layers, restricted Boltzmann machines, autoencoders, 
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Sparse autoencoders, denoising autoencoders, contractive autoencoders, and sparse 
coding techniques, to name a few that can be used for feature learning. Training 
can be done per layer or over all layers, based on the algorithm. The function R 
corresponds to the general call to the underlying algorithm. 

Autoencoders are the most popular technique among unsupervised learning ap- 
proaches; basic encoding and decoding happens between the layers to match the 
input. The number of neurons or the layer size can play an important role in the au- 
toencoder learning. When the size is smaller than the input, it is called undercom- 
plete representation and can be seen as a compression mechanism to find the repre- 
sentation in a lower dimension. When the size is greater than input, it is called over- 
complete representation and requires regularization techniques such as sparsity to 
enforce learning important features. In many practical applications, autoencoders 
are stacked together to create hierarchical or advanced features from the inputs. 

Once these features are learned, the next step is to use the target dataset to fine- 
tune them with a classifier layer such as softmax. There are various choices, such 
as freezing the state of the learned layers at some level k > 1 and only using the rest 
of the layers for tuning or using all the layers for tuning. Algorithm 2 shows how 
the fine-tuning process uses the labeled target dataset of size m. 


Algorithm 2: SupervisedFineTuning 
Data: Training Dataset (x;(7), 2), (X2(r),¥2),+-(Xm(7), Yn) Such that xj) € IR? and 
y; © {+1,-1}, Trained Layers h;,h2,--- ,hy, Training layer start k 
Result: Weight matrix W,; € IR? and b; € R for each layer / 
begin 
appendClassifierLayer(hz 1 1) 
for /=ktoLdo 
| Wid) = train((x1(r),¥1), (X27); 92), + (Xm(7) Yn) 
return W_,, for each layer / 


10.3.2 Theory 


In their seminal work, Erhan et al. give interesting theoretical and empirical insights 
around unsupervised pre-training and fine-tuning [Erh+10]. They use various archi- 
tectures such as feed-forward neural networks, deep belief networks, and stacked 
denoising autoencoders on different datasets to empirically verify the various theo- 
retical conclusions in a step-by-step, controlled manner. 

They show that pre-training not only gives a good starting condition but captures 
complex dependencies among the parameters as well. The research also shows that 
unsupervised pre-training can be a form of regularization that guides the weights 
towards a better basin of attraction of minima. The regularization obtained from 
the pre-training process influences the starting point in supervised learning, and the 
effect does not disappear with more data in comparison with standard regulariza- 
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tion techniques such as L; /L2. Unsurprisingly, the research concludes that in small 
training data settings, unsupervised pre-training has many advantages. It also shows 
that, in some cases, the order of training examples impacts the results, but pre- 
training reduces the variance even in such cases. The experiments and results that 
show unsupervised pre-training as a general variance reducing technique and even 
optimization technique for better training are insightful. 


10.3.3 Applications in NLP 


Using unsupervised techniques for word embeddings from a large corpus of data 
and employing it for various supervised tasks has been the most basic application in 
NLP. Since this was discussed at length in Chap.5, we will focus more on other 
NLP tasks. Dai and Le show that unsupervised feature learning using sequence 
autoencoders or language model-based systems and then using supervised train- 
ing achieves great results in text classification tasks on various datasets, such as 
IMDB, DBpedia, and 20 Newsgroup [DL15]. The sequence autoencoder uses an 
LSTM encoder—decoder to capture the dependencies in an unsupervised manner. 
The weights from the LSTM are used to initialize the LSTM with softmax classi- 
fier in a Supervised setting. The unsupervised autoencoder training shows superior 
results across all datasets, and the generality of the technique gives it an edge for all 
sequence-to-sequence problems. 

Ramachandran et al. show that the LSTM encoder pre-trained for language 
modeling can be used very effectively without fine-tuning in sentiment classifica- 
tion [RLL17]. Deing et al. show that TopicRNN, an architecture using RNN for 
local syntactic dependencies and topic modeling for global semantic latent repre- 
sentations, can be a very effective feature extractor [Die+16]. TopicRNN achieves 
nearly state-of-the-art results on sentiment classification task. Turian et al. show that 
learning features in an unsupervised manner from multiple embeddings and apply- 
ing it to various supervised NLP tasks, such as chunking and NER, can give nearly 
State-of-the-art results [TRB10]. 


10.3.4 Applications in Speech 


Very early, Dahl et al. showed in their research that unsupervised pre-training gives 
a great initialization for the weights, and using labeled fine-tuning on deep belief 
networks further improves results in automatic speech recognition task [Dah+12]. 
Hinton et al. show unsupervised pre-training for learning layer by layer in RBMs 
and then fine-tuning with labeled examples not only reduces overfitting but reduces 
time to learn on labeled examples [Hin+ 12]. Lee et al. show that unsupervised fea- 
ture learning done on large dataset can learn phonemes that can help various audio 
classification tasks using deep convolutional networks [Lee+09]. 
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10.4 Multitask Learning 


Whether specifically in deep learning or generically in machine learning, the over- 
all process is to learn a model for a task at hand given the dataset corresponding to 
that task. This can be seen as single task learning. An extension of this is multitask 
learning (MTL), where one tries to learn jointly from multiple tasks and their corre- 
sponding datasets [Rud17]. Caruana defines the goal of multitask learning as “MTL 
improves generalization by leveraging the domain-specific information contained in 
the training signals of related tasks.” Multitask learning can also be referred to as 
inductive transfer process. The inductive bias introduced in MTL is through forc- 
ing the model to prefer hypothesis which explains multiple tasks rather than a single 
task. Multitask learning has been generally effective when there is limited labeled 
data for each task, and there is an overlap between knowledge or learned features 
between the tasks. 


10.4.1 Techniques 


The two general ways of handling multitask learning in deep learning are through 
hard or soft parameter sharing as shown in Fig. 10.5. Hard parameter sharing is 
one of the oldest techniques in NNs with a single model, where the hidden layers 
share the common weights, and task-specific weights are learned at the output lay- 
ers [Car93]. The most important benefit of hard parameter sharing is the prevention 
of overfitting by enforcing more generalization across tasks. Soft parameter sharing, 
on the other hand, has individual models with separate parameters per tasks, and a 
constraint is put to make the parameters across tasks more similar. Regularization 
techniques are often used in soft parameter sharing for enforcing the constraints. 
In the next section, we will go through selected deep learning networks that have 
proven useful for multitask learning. 


10.4.1.1 Multilinear Relationship Network 


One of the earliest deep learning networks for multitask learning was intro- 
duced by Long and Wang, and was known as the multilinear relationship network 
(MRN) [LW15]. MRN showed state-of-the-art performance in different tasks in 1m- 
age recognition. The MRN, as shown in Fig. 10.6, is a modification of the AlexNet 
architecture that was discussed in Chap. 6. The first few layers are convolutional, 
and a fully connected layer learns the transferable features, while the rest of the 
fully connected layers closer to the output learn task-specific features. If there are 
T tasks with training data X,,¥Y,/_,, where X, = x4,--- ,x4, and ¥, =y4,,--- yy. M 
number of training examples and labels of the rth task with D-dimensional feature 
space and C-cardinality label space, network parameters of t task in the /th layer are 
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(a) (b) 


Fig. 10.5: Two generic methods of multitask learning. (a) Hard parameter sharing 
in the hidden layers. (b) Soft parameter sharing across the hidden layers for various 
tasks 


given by W'! € RP *?>, where Dj, and D5, are the dimensions of matrix W’! and 
parameter tensor W! = [W!¥;...;W?] € Rix))*T The fully connected layers 
(fc6 — fc8) learn the mappings given by hij! = a (Wen! +b’), where hi’ is 
the hidden representation for each data instance x/,, W’ ! is the weight, b’! is the 
bias, and a! is the activation function, such as ReLU. The classifier of tth task is 
given by y = f;(x), and the empirical error is given by: 


Nr 
min J (fi(%n),Yn) (10.1) 
n=1 


where J(-) is the cross-entropy loss function and f; (xi, ) is the conditional probability 
that the network assigns for the data point xj, to the label yj,. MRN has tensor normal 
priors over the parameter tensors in the fully connected task-specific layers similar 
to Bayesian models that acted as regularization on task related learning. 

The maximum a posterior! (MAP) estimation of network parameters W = 
W! :1 © L£ for task-specific layers £ = fc7, fc8 given the training data is: 


P(WIX, ¥) 2 PW) - P(Y|X,W) (10.2) 


N 
TL? lx. W) (10.3) 
1 


n= 


T 
P(W|X,Y) = [] Pw") -T] 
lEkL t=1 
with assumptions made that the prior P(W’) and the parameter tensors W! for each 
layer are independent of the other layers. 

The maximum likelihood estimation (MLE) part P(Y|X,W) is modeled to learn 
the transferable features in the lower layers, and all the parameters for the layers 
(conv1 — fc6) are shared. The task-specific layers (fc7, fc8) are not shared to avoid 
negative transfer. The prior part p(W) is defined as the tensor normal distribution 
and is given as: 


P(W) = INpy nt er (0,21, 22,23) (10.4) 
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where Y!, ©), and ye are the modes of covariance matrices. In the tensor prior, the 
; logs 
row covariance matrix a € R”:**1 learns the relationships between the features, 


the column covariance matrix Dy € R2*P Jearns the relationship between classes, 
and the covariance matrix pa € R’** learns the relationships between tasks in the 
Ith layer parameters W! = W!;...;W/?". The empirical error given in Eq. 10.1 is 
integrated with the prior given in Eq. 10.4 into the MAP estimation given in Eq. 10.3 
and following the process of taking a negative logarithm, the equation to optimize 1s: 


T M 


2X LI 


ii <1 ‘a 1 t=l1n= 


K 
+5 (ae (W')" (2.x) 'vec(W") — -yon n(24) (10.5) 
k=1 Px 


2 ieL 


where D! = [];_, Dj, and K = 3 is the number of modes in parameter tensor W 
or K = 4 for the convolutional layers and pa ee = pa © py, & pI is the Kronecker 
product of feature, class, and task covariances. The optimization problem given in 
Eq. 10.5 is jointly non-convex with respect to parameter tensors and the covariance 
matrix, and hence one set of variables is optimized while keeping the rest of them 
fixed. The experiments with MRN on different computer vision multitask learning 
datasets show that it can achieve state-of-the-art performance. 


10.4.1.2 Fully Adaptive Feature Sharing Network 


Lu et al. take the approach of task-specific learning as a search, starting from a 
thin network and then branching out in a principled way to form wide networks 
during the training process [Lu+16]. The approach also introduces a new technique, 
simultaneous orthogonal matching pursuit (SOMP), for initializing a thin network 
from a wider pre-trained network for faster convergence and improved accuracy. 
The methodology has three phases: 


1. Thin Model Initialization: Since the network (thin) is of different dimension 
than the pre-trained network, the weights cannot be copied. As a result, it 
uses SOMP for learning how to select the subset of rows d’ from the orig- 
inal rows d for every layer /. This is a non-convex optimization problem, 
and hence a greedy approach is used for solving it, described in detail in the 
paper. 

2. Adaptive Model Widening: After the initialization process, each layer starting 
from the top layer goes through a widening process. The widening process can 
be defined as creating sub-branches in the network, so that each branch does a 
subset of tasks performed by the network. A point where it branches is called 
a junction, and it is widened by having more output layers. Figure 10.7 shows 
the iterative widening process. If there are 7 tasks, the final output layer / of 
the thin network has a junction with T branches and each can be considered as 
a sub-branch. The iterative process starts with finding t branches by grouping 
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Fig. 10.6: A multilinear relationship network, in which the first few layers learn the 
shared features, and the final layers learn task-specific features with tensor normal 
priors 


things such that t < T at the layer / and then recursively move in a top-down 
manner to the next layer /— 1 and so on. The grouping of the tasks is done 
by associating a concept of “affinity” which is the probability of concurrently 
observing simple or difficult examples from the training data for the pair of 
tasks. 

3. Final Model Training: The last step is to train the final model after the thin 
model initialization and the recursive widening process. 


10.4.1.3 Cross-Stitch Networks 


As shown in Fig. 10.8, these deep networks are modifications of AlexNet, 
where shared and task-specific representations are learned using linear combi- 
nations [Mis+16]. For each task, there is a deep network such as AlexNet and 
cross-stitch units have a connection between pooling layers as input to either con- 
volution or fully connected ones. The cross-stitch units are linear combinations 
between the task outputs to learn the shared representation. They were shown to be 
very effective in a data-starved multitask setting. 

Consider two tasks A and B and a multitask learning on the same input data. 
A cross-stitch unit shown in Fig. 10.9 plays the role of combining two networks 
into a multitask network, such that the tasks control the amount of sharing. Given 
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Fig. 10.7: An iterative process showing how the network is widened at a layer on a 
specific iteration to group the tasks 
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Fig. 10.8: A cross-stitch network trying to learn a latent representation that is useful 
for two tasks 


two outputs of activations x4,xg from a layer 1, a linear combination is learned to 
produce outputs X4,Xg using parameters a, which flows into the next layers and for 
a location (i, j) is given by: 
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Fig. 10.9: Cross-stitch unit 


10.4.1.4 A Joint Many-Task Network 


NLP tasks generally can be considered to be in the pipeline of hierarchy, where one 
task can be useful and used as input to the next task. S@gaard and Golberg show that 
supervised multitasking at different layers using bidirectional RNN architecture so 
that the low-level tasks feed into high-level tasks can achieve great results [SG16]. 
Hashimoto et al. extend the idea by creating a single end-to-end deep learning net- 
work, where the network has the growing depth to accomplish linguistic hierarchies 
from syntactic and semantic representations, as shown in Fig. 10.10 [Has+16]. It 
has been shown that a single end-to-end network with this architecture can achieve 
state-of-the-art results in different tasks such as chunking, dependency parsing, se- 
mantic relatedness, and textual entailment. 

A given sentence s of length / has w; words. For each word, there is skip-gram 
word embedding and character embedding. The word representation x; is done by 
concatenating both word and n-gram character embeddings which are learned us- 
ing skip-gram with negative sampling for words. The character n-grams are used to 
give morphological features for the tasks. The first task is of POS tagging and is 
performed using bidirectional LSTM with embedded inputs and softmax for clas- 
sifying the tags. The POS tags are learnable embeddings which is used in the next 
chunking layer. The label embedding for POS tagging (and many other tasks) is 
given by: 
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Fig. 10.10: A joint multitask network 


C 
yr =D py; = fly LG) (10.7) 

j=l 
where C is the number of POS tags, p() is the probability that jth POS tag is as- 
signed to the wth token, and / is the label embedding for jth POS tag. The second 
task 1s chunking, which uses bidirectional LSTM and takes the hidden state from 
POS bidirectional LSTM, the hidden state of its LSTM, embedded token and la- 
bel embedding from POS tagging. The third task is dependency parsing with inputs 
from hidden states from chunking layer, previous hidden state from dependency 
parsing, embedded token and label embeddings of POS layer and the chunking layer. 
POS tagging layer and the chunking layer with hidden states are useful in generating 
low-level features that are useful for many tasks as known from traditional feature 
engineering in NLP. The fourth task is dependency parsing, again using bidirectional 
LSTM with inputs as hidden LSTM states, embedded tokens, and label embedding 
from POS tagging and chunking layer. The next two tasks are semantically related 
as compared to syntactic tasks in the previous layers. The semantic relatedness task 
is to compare two sentences and give a real-valued output for a measure of their 
relatedness. The sentence level representation is obtained via max pooling of the 

hidden states of the LSTM and is given by: 


hee = max (hie, aa _— hyelaty (10.8) 
The relatedness for two sentences (s,s ) is given by: 


FA (55) = Wee = hi | ie © ne) (10.9) 
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The values of d (s,s ) are given to softmax layer with maxout hidden layer to 
give a relatedness score. 

The last task is that of textual entailment, which again takes two sentences and 
gives one of the categories of entailment, contradiction, or neutrality. The label em- 
beddings from the relatedness task along with distance measure similar to Eq. 10.9 
in the relatedness derived from LSTM layer feed into a softmax classifier for classi- 
fication. 

When a network is sequentially trained for one task and then trained on another 
task, it generally “forgets” or has bad performance on the first. This phenomenon 
is called catastrophic interference or catastrophic forgetting. The training for 
each layer is similar with a loss function that takes into account (a) a measure for 
classification loss for the layer using predictions and label, (b) L2-norm of its weight 
vectors, and (c) a regularization term for parameters of previous tasks if they are 
inputs. Joint learning gives the framework robustness from catastrophic interference 
according to the authors. An example is for chunking layer, with inputs from POS 
tagging given by weights and bias @pgs and one after POS layer with current epoch 
given by Pros: weights of chunking layer Wcyx, and probability p(yS"* = a|h@"*) 
of assigning correct label @ to w; in the sentence is: 


Jy(OcHK) = ~ 2 Llogr( (yCHR = oy HC#K) + A\|Wenxl| + 5||@r0s — Bposll 
(10.10) 


10.4.1.5 Sluice Networks 


Ruder et al. recently proposed a general deep learning architecture, known as sluice 
networks, that combines concepts from many previous types of research such as 
hard parameter sharing, cross-stitch networks, block-sparse regularization, and NLP 
linguistic hierarchical multitask learning [Rud17]. The sluice network for the main 
task A and an auxiliary task B consists of the shared input layer, three hidden layers 
per task, and two task-specific output layers as shown in Fig. 10.11. Each hidden 
layer for the task 1s an RNN divided into two subspaces, for example, task A and 
layer 1 has G4.) and G4 12, which allows them to learn task-specific and shared 
representations effectively. The output of hidden layers flows through @ parameters 
to the new layer, which carries out linear combinations of the inputs to weigh the 
importance of sharing and task-specific learning. By making the subspaces each 
have their weights and controlling how they share, sluice networks have an adaptive 
way of learning in multitask settings only things that are useful. The final recurrent 
hidden layers pass the information to B parameters which try to combine all the 
things the layers have learned. Ruder et al. empirically show how main tasks, such 
as NER and SRL, can benefit from auxiliary tasks such as POS and improve on 
errors by a significant value. 

Ruder et al. cast the entire learning as a matrix regularization problem. 
If there are M tasks that are loosely related with M non-overlapping datasets 
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Fig. 10.11: Sluice networks for multitask learning across loosely connected tasks 


D,,D2,---,Dy, K layers given by L,,Lo,---,Lx, and models 0;,@2,--- , @y each 
with D parameters and an explicit inductive bias Q as penalty, then the loss function 
to minimize is given by: 


Ai £1 (f(x; 01), y1) ++ FAuM Eu (E(x: Ou), ym) + Q (10.11) 


The loss functions £; are cross-entropy loss functions, and the weights A; deter- 
mine the importance of the task 7 during the training. If G,, 4,1 and G42 are the two 
subspaces for each layer, the inductive bias is given by the orthogonality constraints: 


M K 
Q= > >, Guia Gar 


m=1k=1 


(10.12) 





The matrix regularization is carried out by updating the @ parameters with sim- 
ilarity to Misra et al.’s cross-stitch units [Mis+16]. For the two tasks (A, B) and k 
layers for one subspace, the extension to cross-stitch linear combination looks like: 


Pas i mv - ~ 
iS) = 4 (10.13) 
ie, ota, _— 0,5, | hes, 


where ha, , is the output of first subspace for task A in the layer k, and ha, , 1S 
the linear combination of that first subspace and task A. The input to layer k + 1 
is the concatenation of the two, given as ha, = ha, wha, J The hierarchical rela- 
tionship between the low-level tasks and the high- level tasks is learned using the 
skip-connections between the layers with the B parameters. This acts as a mixture 
model and can be written as: 
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where hi x, is the output of layer k for task A, and has is the linear combination of 
all layer outputs that gets fed to a softmax classifier. 


hy ai’ «hax | (10.14) 


10.4.2 Theory 


Caruana in his early research on MTL and then Ruder in his work have summarized 
different reasons why and when multitask learning works and 1s effective [Car97, 
Rud17]. 


1. Implicit Data Augmentation—When the constraint is limited data per task, 
by jointly learning different tasks which are similar, the total training data size 
increases. As learning theory suggests, the more the training data, the better the 
model quality. 

2. Attention Focusing—When the constraint is noisy data per task, by jointly 
learning different tasks, focus on relevant features that are useful across tasks to 
get more attention. This joint learning, in general, helps as an implicit feature 
selection mechanism. 

3. Eavesdropping—When the training data is limited, the features that may be 
needed for a particular task may not be in the data. By having multiple datasets 
for multiple tasks, features can eavesdrop, 1.e., the features learned for a separate 
task can be used for the task in question and help in the generalization of that 
specific task. 

4. Representation Bias—Multitask learning enforces a representation that gener- 
alizes across the tasks and thus forces better generalization. 

5. Regularization—Miultitask learning is also considered as a regularization tech- 
nique through inductive bias, which theoretically and empirically is known to 
improve model quality. 


10.4.3 Applications in NLP 


In his work, Rei shows that using language modeling as an auxiliary task along with 
sequence labeling tasks such as POS tagging, chunking, and named entity detection 
for the main task can improve the results significantly over the benchmarks [Rei17]. 
Fang and Cohn illustrate the advantage of cross-lingual multitask joint learning for 
POS tagging in a low-resource language [FC17]. Yang et al. show that a deep hi- 
erarchical neural network with cross-lingual multitask learning can achieve state- 
of-the-art results in various sequence tagging tasks, such as NER, POS tagging, 
and chunking [YSC16]. Duong et al. use cross-lingual multitask learning to achieve 
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high accuracy in a low-resource language for dependency parsing [Duo+15]. Col- 
lobert and Weston show that multitask learning using CNNs across various tasks 
can achieve great accuracies [CW08]. 

Multitask learning has been most successful in machine translation tasks, either 
employing them at the encoder stage, or the decoder stage, or both. Dong et al. 
successfully employ single source to multiple language translation using MTL at 
the encoder stage in sequence-to-sequence network [Don+15]. Zoph and Knight 
employ multi-source learning as MTL, using French and German sources to trans- 
late effectively to English using MTL at the decoder stage [ZK16]. Johnson et al. 
show that jointly learning the encoders and decoders enables to have a single model 
for multiple source and targets in a unified way [Joh+16]. Luong et al. perform a 
more comprehensive study of the sequence-to-sequence and multitask learning at 
various stages of encoding—decoding on several NLP tasks, including translation, 
to show benefits [Luo+15]. Niehues and Cho in their research on German-English 
translation explore how tasks, such as POS tagging and NER, can help machine 
translations, as well as improve results in these tasks [NC17]. 

Choi et al. use multitask learning to learn sentence selection in comprehension 
first and using that for question—answer model to get superior results [Cho+17]. 
Another exciting work uses a large corpus of data to learn and rank the passages 
that are likely for question—answers and then uses joint training of these passages 
with QA models to give state-of-the-art results in open QA tasks [Wan+18]. 

Jiang shows how multitask learning, when applied together with weakly super- 
vised learning for extracting different relation or role type using a joint model, can 
improve results [Jia09]. Liu et al. show that joint multitask learning using a deep 
neural network in low-resource datasets can improve results in query classification 
and web search ranking [Liu+15]. Katiar and Cardie show how joint extractions 
of relations and mentions using attention-based recurrent networks improve on tra- 
ditional deep networks [KC17]. Yang and Mitchell highlight how a single model, 
which can learn two tasks of semantic role labeling and predicting relations learned 
jointly, can improve over the state of the art [YM17]. 

Isonuma et al. show how summarization using a small number of summaries 
and document classification done together give comparable results to the state of 
the art [Iso+17]. In a specific domain such as legal, Luo et al. show how classifi- 
cation with relevant article extraction, when learned jointly, can give improved re- 
sults [Luo+17]. Balikas et al. show how separate sentiment analysis tasks of learn- 
ing ternary and fine-grained classification can be improved using joint multitask 
learning [BMA17]. Augenstein and Sgégaard showcase improvements in keyphrase 
boundary classification when learning auxiliary tasks, such as semantic super-sense 
tagging and identification of multi-word expressions [AS17]. 
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10.4.4 Applications in Speech Recognition 


Watanabe et al. highlight how multiple tasks associated with speech recognition can 
be performed in a hybrid end-to-end deep learning framework [Wat+ 17]. This archi- 
tecture combines two main architectures, CTC loss and attention-based sequence- 
to-sequence, to give results that are comparable with previous HMM-deep learning 
based methods. Watanabe et al. again highlight how multiple tasks such as auto- 
matic speech recognition (ASR) and language identification/classification across 
ten languages can be performed at the same time using end-to-end deep learning 
with multitask learning [WHH17]. Watanabe et al. highlight how multitask learning 
on ASR and speaker identification can improve the total performance significantly 
compared to separately trained models [Wat+18]. 


10.5 Case Study 


In this case study, we explore how multitask learning can be applied to some com- 
mon NLP tasks such as POS tagging, chunking, and named entity recognition. The 
overall performance depends on many choices such as sequence-to-sequence archi- 
tecture, embeddings, and sharing techniques. 

We will try to answer whether the low-level tasks such as POS tagging can bene- 
fit the high-level tasks such as chunking? What would be the impact of joint learning 
with closely related tasks and loosely related tasks? Is there an impact of connec- 
tivity and sharing on learning? Is there a negative transfer and how that impacts the 
learning? Do the neural architecture and embedding choices impact the multitask 
learning? We will use CoNLL-2003 English dataset which has annotations at token 
levels for each of the tasks in our experiments. CONLL-2003 dataset already has the 
standard splits of train, validation, and test. We will use accuracy on the test set as 
our performance metric for the case study. 


e Exploratory data analysis 
e Multitask learning experiments and analysis 


10.5.1 Software Tools and Libraries 


We will describe the main open source tools and libraries we have used below for 
our case study: 


e PyTorch: We use http://github.com/pytorch/pytorch as our deep learning toolkit 
in this case study. 

e GloVe: We use https://nlp.stanford.edu/projects/glove/ for our pre-trained em- 
beddings in the experiments. https://github.com/SeanNaren/nlp_multi_task_ 
learning _pytorch/ for multitask learning experiments. 
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10.5.2 Exploratory Data Analysis 


The raw data for training, validation, and testing have a columnar format with 
annotations for each token as given in Table 10.1. 


Table 10.1: Raw data format 


Tokens POS CHUNK NER 
U.N. NNP I-NP I-ORG 
official NN _ I-NP O 
Ekeus NNP I-NP I-PER 
heads VBZ I-VP O 

O 


for IN I-PP 
Baghdad NNP I-LOC  [I-LOC 
= — O O 


Basic analysis for total articles, sentences, and tokens for each dataset is given 
in Table 10.2. The tags follow the “inside—outside—beginning” (IOB) scheme for 
chunking and NER. 

NER categories and number of tokens for each are given in Table 10.3. 


Table 10.2: Data analysis of CoNII-2003 


Dataset Articles Sentences Tokens 
Training 946 14,987 203,621 
Validation 216 3466 51,362 
Test 231 3684 46,435 — 


Table 10.3: NER tags analysis of CoNII-2003 


Dataset LOC MISC ORG PER 
Training 7140 3438 6321 6600 
Validation 1837 922 1341 1842 
Test 1668 702 1661 1617 
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10.5.3 Multitask Learning Experiments and Analysis 


We base our model on Sgégaard and Golberg’s research using bidirectional RNNs 
for encoder and decoder networks in “joint learning” mode. We explore “joint 
learning” in two different configurations: (a) shared layers between all the tasks that 
are connected to three different softmax layers (POS, chunk, and NER) and (b) each 
RNN is in different layer and the hidden layer of the lower layer flows into the next 
higher layer as shown in Fig. 10.12. 
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Fig. 10.12: Bidirectional LSTM configured for multitask learning with cascading 
layered architecture 


We highlight the code below for the JointModel class where all the configu- 
rations of (a) individual learning, (b) joint with shared layers, and (c) joint with 
cascading are defined. 


1 # initialization of the graph 
2 def forward(self , input, x*hidden): 


3 if self.train_mode == ’Joint’: 
4 + whenvthe mumber Of layers 1s same, hidden Laycrs 
are shared 

5 # and connected to different outputs 

6 if self.nlayersl == self.nlayers2 == self .nlayers3 

7 logits , shared_hidden = self .rnn(input, hidden 
[0]) 

8 outputs_pos = self.linearl (logits ) 


9 outputs_chunk = self.linear2(logits ) 
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outputs_ner = self.linear3(logits ) 
return outputs_pos , outputs_chunk , outputs_ner 
, Shared_hidden 
7 Cascading architecture where low—level tasks 
flow into high level 
else: 
# POS tagging task 
logits_pos , hidden_pos = self.rnnl (input, 
hidden [0]) 
self.rnn2.flatten_parameters () 
# chunking using POS 
logits_chunk , hidden_chunk = self .rnn2( 
logits_pos , hidden[1]) 
self.rnn3.flatten_parameters () 
# NER using chunk 
logits_ner , hidden_ner = self .rnn3( 
logits_chunk , hidden[2]) 
outputs_pos = self.linearl (logits_pos ) 
outputs_chunk = self.linear2 (logits_chunk ) 
outputs_ner = self.linear3(logits_ner ) 
return outputs_pos , outputs_chunk , outputs_ner 
, hidden_pos, hidden_chunk , hidden_ner 
elses: 
# individual task learning 
logits , hidden = self .rnn(input, hidden[0]) 
outputs = self.linear(logits ) 
return outputs , hidden 


Since we have different tasks (POS, chunking, and NER), input layer choices 
(pre-trained embeddings or embeddings from the data), neural architecture choices 
(LSTM or bidirectional LSTM), and MTL techniques (joint shared and joint sepa- 
rate), we perform the following experiments to gain the insights in a step-by-step 
manner: 


1. 


LSTM + POS + Chunk: We use LSTM in our encoder—decoder, no pre-trained 
embeddings, and use different sharing techniques to see the impact on two tasks, 
POS tagging and chunking. 

LSTM + POS+ NER: We use LSTM in our simple encoder—decoder, no pre- 
trained embeddings, and use different sharing techniques to see the impact on 
two tasks, POS tagging and NER. 

LSTM + POS + Chunk + NER: We use LSTM in our simple encoder—decoder, 
no pre-trained embeddings, and use different sharing techniques to see the im- 
pact on all three tasks, POS tagging, chunking, and NER. 

Bidirectional LSTM + POS + Chunk: We use bidirectional LSTM in our 
encoder—decoder, no pre-trained embeddings, and use different sharing tech- 
niques to see the impact on two tasks POS tagging and chunking. The impact 
of the neural architecture on the learning will be evident from this experiment. 
LSTM + GloVe + POS + Chunk: We use LSTM in our encoder—decoder, pre- 
trained GloVe embeddings, and use different sharing techniques to see the im- 
pact on two tasks, POS tagging and chunking. The impact of pre-trained em- 
beddings on the learning will be evident from this experiment. 
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Bidirectional LSTM + GloVe + POS + Chunk: We use bidirectional LSTM 
in our encoder—decoder, pre-trained GloVe embeddings, and use different shar- 
ing techniques to see the impact on two tasks, POS tagging and chunking. This 
experiment gives us insight into how the combination of architecture and em- 
beddings impacts the learning for the two tasks. 

Bidirectional LSTM + GloVe + POS + NER: We use bidirectional LSTM in 
our encoder—decoder, pre-trained GloVe embeddings, and use different sharing 
techniques to see the impact on two tasks, POS tagging and NER. This experi- 
ment gives us insight into how the combination of architecture and embeddings 
impacts the learning for the two tasks. 

Bidirectional LSTM + GloVe + POS + Chunk + NER: We use bidirectional 
LSTM in our encoder—decoder, pre-trained GloVe embeddings, and use dif- 
ferent sharing techniques to see the impact on all three tasks, POS tagging, 
chunking, and NER. This experiment gives us insight into how the combination 
of architecture and embeddings impacts the learning when there are multiple 
tasks. 


We run all the experiments with parameters of input embeddings with or without 
pre-trained of 300 dimensions, 128 as the number of hidden units, 128 as the batch 
size, 300 as the number of epochs, ADAM optimizer, and cross-entropy loss. 

In the tables below we have given individual experiment results, and color-coded 
the results which show improvement with green and where it deteriorates with red. 


Table 10.4: Expt 1: LSTM + POS + Chunk 


POS single task 86.33 


Chunk single task I 84.69 
MTL joint shared 83.91 85.23 
MTL joint separate 86.88 85.78 





Table 10.5: Expt 2: LSTM + POS + NER 


[Models [POS Ace % [NER Ace % 
POS single task 86.33 


NER single task - 84.92 
MTL joint shared 85.62 8.28 
MTL joint separate 86.72 9.745 
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Table 10.6: Expt. 3: LSTM + POS + Chunk + NER 


OS Ace % [Chunk Acc % _ [NER Acc % 


742 ee 
5 
- 0.08 
591 
71 





Some interesting observations from the experiments are: 


Table 10.7: Expt 4: Bidirectional LSTM + POS + Chunk 


POSsingletask _—*(8656 
(Chunk single tsk _[-__——*(8688 
$8.20 
e711 





Table 10.8: Expt 5: LSTM + GloVe + POS + Chunk 


POS single task 90.55 f= 
(Chunk single tsk [= [5805 





Table 10.9: Expt 6: Bidirectional LSTM + GloVe + POS + Chunk 


(Chunk single task ———=iROD 





Table 10.10: Expt 7: Bidirectional LSTM + GloVe + POS + NER 


POS Acc % NER AG 
POS single task 92.42 I 
NER single task - | 95.08 


MTL joint shared 92.89 95.70 
MTL joint separate 92.19 
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Table 10.11: Expt. 8: Bidirectional LSTM + GloVe + POS + Chunk + NER 


POS Acc % Chunk Acc % |NER Acc % 
POS single ik Xn a 
Chunk single task «88.52 


NER single task a A 
MTL joint shared 92.89 89.53 94.92 
MTL joint separate 91.95 90.00 95.31 





e Tables 10.4 and 10.5 show that joint multitask learning with separate 
LSTM layers as compared to shared layer between both improve the per- 
formance for both combinations, 1.e., POS tagging and chunking and POS 
tagging and NER. 

e Table 10.6 shows that when all three tasks are combined joint MTL with 
shared as well as with separate layers, the results deteriorate except for 
chunking. These results are in contrast with Tables 10.4 and 10.5 and show 
that when there is a mix of tasks which are not all related strongly, the 
“negative transfer’ comes into the picture. 

e The experiment results in Table 10.7 use bidirectional LSTM and show 
similar performance as LSTM models in Table 10.4, indicating that just 
by adding architectural complexity by itself does not change the multitask 
behavior at least in this case. 

e Introducing pre-trained embeddings using GloVe vectors shows a huge in- 
crease of around 4% in performance of single tasks for both POS tagging 
and chunking as shown in Table 10.8. The marginal improvements in MTL 
are similar to without GloVe. 

e Experiment 6 as given in Table 10.9 shows that when both bidirectional 
LSTM and pre-trained GloVe vectors are used, not only the individual tasks 
improve but the behavior of multitask learning is different as that of the ba- 
sic first experiment in Table 10.4. The shared and separate layers both show 
worse performance than single tasks here. Somehow, better the individual 
task performance is, the impact of multitask learning diminishes. 

e Experiment 7, where we combine bidirectional LSTM and pre-trained 
GloVe for POS tagging and NER results as given in Table 10.10, shows 
very different results than experiment 2 as given in Table 10.5. The joint 
multitask learning using shared shows performance boost for both tasks 
which has not been seen in the previous experiments. 

e Experiment 8, results given in Table 10.11, where we combine all tasks 
with bidirectional LSTM and GloVe, shows different performance as com- 
pared to the experiment 3 as given in Table 10.6. The POS tagging and 
chunking show improvements with shared but NER shows deterioration 
in performance. Except for chunking, all others show worse performance 
with separate layers as compared to experiment 3. 
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10.5.4 Exercises for Readers and Practitioners 


Some of the extensions and extra ideas for researchers to try are given below: 


1. 
Zs 


What is the impact of using different pre-trained embeddings such as word2vec? 
What is the impact of adding more layers to RNN for both shared and separate? 
Does that change the MTL behavior? 


. We tried MTL with LSTM but not with GRU or even base RNN, is there a 


significant difference in the performance of MTL with the choice of recurrent 
networks? 


. What is the impact of hyperparameters like the number of hidden units, batch 


size, and epochs on the MTL? 


. If we add more tasks such as language models, sentiment classification, seman- 


tic role labeling, to name a few in the mix, what would be the performance 
impact on MTL? 


. Use the same dataset with other research like cross-stitch networks, sluice net- 


works, and others to get comparative analysis across the methods. 
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Chapter 11 ®@ 
Transfer Learning: Domain Adaptation ces 


11.1 Introduction 


Domain adaptation is a form of transfer learning, in which the task remains the 
same, but there is a domain shift or a distribution change between the source and 
the target. As an example, consider a model that has learned to classify reviews on 
electronic products for positive and negative sentiments, and is used for classifying 
the reviews for hotel rooms or movies. The task of sentiment analysis remains the 
same, but the domain (electronics and hotel rooms) has changed. The application 
of the model to a separate domain poses many problems because of the change 
between the training data and the unseen testing data, typically known as domain 
shift. For example, sentences containing phrases such as “loud and clear” will be 
mostly considered positive in electronics, whereas negative in hotel room reviews. 
Similarly, usage of keywords such as “lengthy” or “boring” which may be prevalent 
in domains such as book reviews might be completely absent in domains such as 
kitchen equipment reviews. 

As discussed in the last chapter, the central idea behind domain adaptation is to 
learn from the source dataset (labeled and unlabeled) so that the learning can be used 
on a target dataset with a different domain mapping. To learn the domain shift be- 
tween the source and the target, traditional techniques that are employed fall under 
two broad categories: instance-based and feature-based. In instance-based, the dis- 
crepancy between the source and the target domain is reduced by reweighting source 
samples and learning models from the reweighed ones [BM10]. In feature-based, a 
common shared space or a joint representation is learned between the source and 
the target where the distributions match [GGS13]. In recent times, deep learning 
architectures have been successfully implemented for domain adaptation in various 
applications especially in the field of computer vision [Csu17]. In this chapter, we 
discuss at length some of the techniques using deep learning in domain adaptation 
and their applications in text and speech. Next, we discuss techniques in zero-shot, 
one-shot, and few-shot learning that have gained popularity in the domain adapta- 
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tion field. We perform a detailed case study using many techniques discussed in the 
chapter to give the readers practical aspects of domain adaptation at the end. 


In this chapter, we will use the notations that are similar to the research papers 
they cite for easy mapping to the references. 


11.1.1 Techniques 


In this section, we will highlight some of the well-known techniques that can be 
very effective and generic enough for solving domain adaptation problem in text 
and speech. 


11.1.1.1 Stacked Autoencoders 


One of the earliest works in domain adaptation comes from Glorot et al. in the area 
of sentiment classification [GBB11b]. The source domain contains a large number 
of sentiments on Amazon reviews, while the target is completely different products 
with small labeled data. In this work, the researchers use stacked denoising autoen- 
coders (SDA) on the source and target data combined to learn the features as shown 
in Fig. 11.1 as the first step. Then, a linear SVM is trained on the features extracted 
by the encoder part of the autoencoder and is used to predict unseen target data of 
different domains. The researchers report state-of-the-art results for sentiment clas- 
sification across the domains. 

Variants, such as stacked marginalized denoising autoencoders (mSDA), 
which have better optimal solutions and faster training times, have also been 
employed very successfully in classification tasks, such as sentiment classifica- 
tion [Che+ 12]. To explain the method, let us assume that for source S$ and target T, 
we have sample source data Ds = {x1,--+ ,Xn,} € R@ and labels Ls = {y1,--- ,Yng} 
and the target sample data Dr = {Xp,,,,°°* ,Xn} € R¢ and no labels. The goal is 
to learn the classifier h € H with labeled source training data Ds to predict on the 
unlabeled target data Dr. 

The basic building block in this work is the one-layered denoising autoencoder. 
The input for this is the entire set of the source and the target data, 1.e., D = Ds U 
Dr = {X1,--:,Xn} and it is corrupted by removal of feature with probability p > 
0. For example, if the representation of the vector is a bag-of-words vector, some 
values can be flipped from | to 0. Let us consider x; as the corrupted version of x;. 
Instead of using the two-level encoder—decoder, a single mapping W : R? > R? is 
used that minimizes the squared reconstruction loss given by: 


1 n 
— > || xi — Wx; || (11.1) 
2n i} 
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Fig. 11.1: Stacked denoising autoencoders for learning features and SVM as a clas- 
sifier 


If we repeat this m times, the variance gets lowered and the solution for W can 
be obtained from: 





| (11.2) 


Ms 
M:= 


1 - 
XL squared (W) = ne | Xj — WX; 
== 


where X; ; represents the jth corrupted version of the original x; input. 
In matrix notation, with inputs X = [x),--- ,x,] € R¢*" and its m-times repeated 
X and corrupted being X. The loss equation can be written as: 


1 Sener, es 
E squared (W) = =—tr | (X — WX)" (X — WX) (11.3) 


The solution to this in closed-form 1s: 
W = PQ! with Q = XX" and P = XX" (11.4) 


In the limiting case of m —— inf, W can be expressed in terms of the expectations 
of P and Q. 
W = E[PIE[Q)! (11.5) 


Let us consider the E[Q], which is 


E(Q) = > F(X") (11.6) 
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The off-diagonal entries in matrix [x;x;"] are uncorrupted if two features a and B 
both survive the corruption. This has a probability of (1 — p)”. For the diagonal, it 
holds with probability 1 — p. If we define a vector q = [1 — p,---,1—p,1] € R@", 
where qq, represents the probability of the feature @ surviving the corruption, then 
the scatter matrix of the original uncorrupted input can be represented as S = XXT 
and the expectation of matrix Q can be written as: 


oP 1\Sapde  ifa=B 


In a similar way, expectation of matrix P can be derived as E[P]q 3 = Sapp. 

Thus with these expectation matrices, the reconstructive mapping W can be com- 
puted in closed-form without corrupting a single instance x; and “marginalizing” the 
noise. Next, instead of just single layer, the research “stacks” the layer one after an- 
other similar to the stacked autoencoders. The output of the (t — 1)th layer feeds into 
the ¢th layer after a squashing function such as tanh to give a non-linearity and thus 
can be expressed as h’ = tanh(W‘h’~'). The training is performed layer by layer, 
i.e., each layer greedily learns W’ (in the closed-form) and tries to reconstruct the 
previous output h’~!. For domain adaptation, they use the inputs and all the hidden 
layers concatenated as features for SVM classifier to train and predict. 


Some of the advantages of mSDA as compared to others are: 


1. Optimization problem is convex and guarantees optimal solution. 

2. Optimization is non-iterative and closed-form. 

3. One pass through the entire training data to compute the expectations 
E|P| and E[Q] gives a huge training speed boost. 


11.1.1.2 Deep Interpolation Between Source and Target 


Very similar to traditional machine learning, research by Chopra et al. uses source 
and target with different domains to be mixed in different proportions to learn inter- 
mediate representations [CBG13]. This work is known as deep learning for domain 
adaptation by interpolating between domains (DLID). The researchers use convolu- 
tional layers with pooling and predictive sparse decomposition method to learn the 
non-linear features in an unsupervised way. The predictive sparse decomposition 
method is similar to sparse coding models but with fast and smooth approxima- 
tor [KRL10]. The labeled data is passed through the same transformation to get 
features, concatenate them, and use classifier such as logistic regression to get a 
joint model. In this way, the model learns useful features in an unsupervised manner 
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from both source and target. These features can be employed for domain transfer on 
the target alone. Figure 1 1.2a, b show schematically how this process works. 
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Fig. 11.2: DLID method. (a) Unsupervised latent representation learning. The top 
circles show the intermediate path between the source and the target, where the filled 
circles are the intermediate representations, and the empty circles are the source/tar- 
get representations. (b) Unsupervised features and labels with classifier for learning 
model 


Let S be the source domain with data samples Ds, T be the target domain with 
Dr as the data samples, and p € [1,2,--- ,P] be the index over P datasets. The mix 
between source and target is done in such a way that at p = 1, Ds = Dr and from 
then onwards the number of source samples decreases and that of target increases 
in exact same proportion. For each dataset p € |2,---,P—1], Ds, the number of 
samples goes down and Dr goes up incrementally for next p. Each dataset D, as an 
input to a non-linear feature extractor Fy, with weights W, trained in an unsuper- 
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vised manner generates output Zi, = Fy, (X ') Once this is trained in unsupervised 
manner, any labeled training data goes through this DLID representation path ex- 
tracting features Fy, as output, and a concatenation of all the outputs forms the 
representation for that input as: 


Zi = [Fy, (X') Fv, (X') + Fy, (XJ = Z,Z5Z}] (11.8) 


That representation and the label Z’, Y' are passed to the classifier or regressor for 
the task and uses standard loss functions to optimize. The unseen data goes through 
the same path, and predictions from the classifier are used for obtaining the class 
and probability. 


11.1.1.3 Deep Domain Confusion 


The deep domain confusion (DDC) architecture, shown in Fig. 11.3, is proposed by 
Tzeng et al. and is one of the popular discrepancy-based domain adaptation frame- 
works [Tze+14]. The researchers introduce a domain adaptation layer and a confu- 
sion loss to learn a representation that is semantically meaningful and provides do- 
main invariance. A Siamese convolutional network-based architecture is proposed, 
where the main goal is to learn a representation that minimizes the distribution dis- 
tance between the source and the target domain. The representation can be used 
as features along with the source-labeled dataset to minimize the classification loss 
and is applied directly to unlabeled target data. In this work, the task of minimizing 
the distribution distance is done using maximum-mean discrepancy (MMD), which 


computes the distance on a representation @() for both source and target as: 


MMD(X;,X;) = a > o(x,)-— > o(%)| (11.9) 


XsEX, |X; | x EX} 


The representation learned from this is used in the loss function as a regularizer 
with the regularization hyperparameter A also acting as the amount of confusion 
between the source and the target domain: 


L=Lc(Xz,y) +A *MMD(X,,X;) (11.10) 


where Lc(Xz,y) is the classification loss from the labeled data Xz, y is the label 
or the ground truth, and MMD(X,,X;) is the maximum-mean discrepancy (MMD) 
between the source Xs and the target X;. The hyperparameter A controls the amount 
of confusion between the source and the target domain. The researchers use the 
standard AlexNet and modify it to have an additional lower -dimensional bottleneck 
layer “‘fc adapt.” The lower-dimensional layer acts as a regularizer and prevents from 
overfitting on the source distribution. The MMD loss discussed above is added on 
top of this layer so that it learns the representation useful for both source and the 
target. 
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Fig. 11.3: Deep domain confusion network (DDCN) for domain adaptation 


11.1.1.4 Deep Adaptation Network 


Long et al. propose the deep adaptation network (DAN) as shown in Fig. 11.4, which 
is a modified AlexNet, where the discrepancy loss happens at the last fully con- 
nected layers [LW15]. If the null hypothesis is that the samples are drawn from 
the same distribution, and the alternate hypothesis is that they come from two dif- 
ferent distributions, maximum-mean discrepancies (MMD) is one of the statisti- 
cal approaches [Sej+12]. The multiple kernel variant of MMD (MK-MMD) mea- 
sures the reproducing kernel Hilbert space (RKHS) distance between the mean em- 
beddings of two distributions (source and target) with a characteristic kernel k. If 
HH, 1s the reproducing kernel Hilbert space endowed with a characteristic kernel k, 
the mean embedding of distribution p in H{, is a unique element U;,(p), such that 
Ex~wpf (x) = (f(x), Me(P)) 3c, The squared distance for any layer /, kernel k between 
source (S$), and target (7) is given by: 


dy (Ds, Dr) = ||Es[6(x°)] —Ev,[0(x")]|I5¢, (11.11) 
The characteristic kernel associated with the feature map @, k(x°,x™) = 


(@(x°),@(xT)) and is a combination of m positive semi-definite kernels {k, } with 
constraints on the coefficients B, as given by: 


KS {k= > Bk : ¥ B.=1.B.>0} (11.12) 
u=| u=1 


where the derived multi-kernel k is characteristic because of the constraints on co- 
efficients {B,, }. 
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The modified AlexNet has three layers of convolution network (conv1 — conv3) as 
the general transferable feature layers that are frozen after training on one domain. 
The next two convolution layers (conv4 — conv5) are more specific, and hence fine- 
tuning is done for learning domain-specific features. The final fully connected layers 
(fc6 — fc8) are highly specific and non-transferable, so they get adapted using the 
MK-MMD. If all the parameters in the network are given by 0 = {W’,b'}!_, for 
all the layers /, the empirical risk is given by: 


min — J(O(x!), 94) (11.13) 


where J is the cross-entropy loss function and O(x*) is the conditional probability 
of assigning the data point x; a label y;’. By adding the MK-MMD-based multi-layer 
adaptation regularizer to the above risk, we get a loss similar to the DDC loss that 
can be expressed as: 


min ~ ¥J(@( wi )+A a aD (11.14) 


Na j=] l=L, 


where A > 0 is a regularization constant and /; = 6 and J = 8 for the DAN setup. 
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Fig. 11.4: Deep adaptation network (DAN) for domain adaptation 


11.1.1.5 Domain-Invariant Representation 


Many techniques employ domain-invariant representation using the source and the 
target data as a way to learn a common representation that can help in domain adap- 
tation. 

CORrelation ALignment (CORAL) is a technique to align the second-order 
statistics (covariances) of the source and target using a linear transformation. Sun 
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and Saenko extend the framework to learn the non-linear transformation that aligns 
the correlations of the layers, known as Deep CORAL [SS16]. Deep CORAL ex- 
tends AlexNet and has the second-order statistics loss computed at the last layer, 
i.e., a fully connected layer before the output. If Ds = {x;},x € R@ are the source 
domain training data of size ns, and Dr = {u;},u € R? are the unlabeled target data 
of size nr, Dy’ indicates jth dimension of the ith source data instance, D7’ indicates 
jth dimension of the ith target data instance, Cs is the source feature covariance 
matrix, and C7 is the target covariance matrix, then the CORAL loss is measured as 
the distance between the covariances: 


2 
CORAL = qs — CriF (11.15) 


where | - |2. represents the squared matrix Frobenius norm. 
The covariance matrices for the source and the target are given by: 


=! tp. 1 
Co= Cah (IDs a a"Ds)(1"Ds) (11.16) 

__1 (pty, 
Cr — (ap=1) (DEDr nr a"Dr)(1"Dr) (11.17) 


where 1 is the column vector. The joint training that reduces the classification loss 
[crass and the CORAL loss is given by: 


t 
1 = Ictass + >) AicoraL (11.18) 
i=l 


where ¢ is the number of layers, and A is used to balance between classification and 
domain adaptation, aiming at learning a representation common between the source 
and the target (Fig. 11.5). 

There are other domain-invariant representations that have been successfully em- 
ployed in various works. Pan et al. use domain-invariant representation via transfer 
component analysis that uses maximum-mean discrepancies (MMD) and tries to 
reduce the distance between the two domains in the subspace [Pan+11]. Zellinger 
et al. propose a new distance function—central moment discrepancy (CMD)—to 
match the higher-order central moments of probability distributions [Zel+-17]. They 
show the generality of their techniques in domain adaptation across object recogni- 
tion and sentiment classification tasks (Fig. 11.5). 


11.1.1.6 Domain Confusion and Invariant Representation 


The research by Tzeng et al. on deep domain confusion has the disadvantage that 
it needs both large labeled data in the source domain and sparsely labeled data in 
the target domain. Tzeng et al. in their work propose domain confusion loss over 
both labeled and unlabeled data to learn the invariant representation across domains 
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Fig. 11.5: Deep CORAL network for domain adaptation 


and tasks [Tze+15]. The transfer learning between the source and the domain is 
achieved by a) maximizing the domain confusion by making marginal distributions 
between source and target as similar to each other as possible; and b) transfer of 
correlation between classes learned on source examples to target examples. The 
completely labeled source data (xs, ys) and the sparsely labeled target data (x7, yr) 
are used to produce a classifier 0c that operates on feature representation f(x; 9,¢p,) 
parameterized by representation parameters 0,¢p, and has good accuracy in classi- 
fying the target samples: 


L£c(x,¥3 Qrepr; Oc) = — Y Aly = kK] log (px) (11.19) 
k 


where p = softmax( 64 f(x; epr))- 

To ensure that there is alignment between the classes in source and target instead 
of having “hard labels” to train on, the “soft label” is averaged over the softmax 
of all activations of labeled source data for a particular class. A high temperature 
parameter T is used in the softmax function, so that related classes have similar 
effects on the probability mass during fine-tuning. The soft label loss is given by: 


Lsopt(xr,.YT3 repr, Oc) = — >, HF" log(pi) (11.20) 


where p; = softmax(02f (x73 Qrepr)/T). 

A domain classifier layer with parameters @p is used to identify whether the 
data comes from the source or the target domain. The best domain classifier on the 
representation can be learned using the objective: 
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Lp(Xs,X7, Qreprs Op) — —Y llyp = d| log (qa) (11.21) 
d 


where q = softmax(05,f (x; Qepr))- 

Thus, for a particular domain classifier, Op, the loss that maximizes the confusion 
can be seen as a cross-entropy loss between the prediction of the domain and the 
uniform distribution over the labels and can be written as: 


1 
ce confusion Chere Op; O-epr) —— » D log (qa) (11.22) 
d 


The parameters @p and @,¢,, are learned iteratively by the following objectives: 


min Lp (xs, Xr, Ore pr’ Op) (1 N23) 
D 
min L confusion (X50; Op; O,epr) (1 1.24) 


repr 


Thus, the joint loss function can be written as: 


L(xs,Ys,XT YT Orepr Oc) = Lc(xs, Vs, XT, YT Oc, O-epr) 
+ A contin (xs,X7, Op; O-epr) 
+ VE so ft(x7 YT Prepr, Oc) (11.25) 


where A and v are the hyperparameters that control the domain confusion and the 
soft label influence during the optimization. 


11.1.1.7 Domain-Adversarial Neural Network 


Ganin et al. employ an interesting technique of “gradient reversal” layer for domain 
shift adaptation through domain-adversarial neural network (DANN) [Gan+16b]. 
The process is generic to all neural networks and can be easily trained using standard 
stochastic gradient methods. They show state-of-the-art results in different domains 
of computer vision and sentiment classification. 

LeeS = 46 ie SD) Sa (DX)" be the source and tar- 
get data drawn from Ds and D7 as distribution; N = n-+n’ is the total number 
of samples. D7 is the marginal distribution of Dr over the input space X, and 
Y =0,1,---,L—1 1s the set of labels. The network has three important layers: (a) 
the feature generation layers which learn features from the inputs with parameters. 
The hidden layer Gy : X + R® parameterized by matrix—vector pair 0¢ = (W,b): 


G(X; OF) = o(Wx+b) (11.26) 
(b) the label prediction layer Gy, : R? = (0, 1]* parameterized by matrix—vector pair 


0, = (V,c): 
Gy (G(x); 0,) = softmax(Vx + ce) (11.27) 
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and (c) the domain classification layer Gz : R? — [0,1] is a logistic regressor pa- 
rameterized by vector-scalar pair 0; = (u,z) that predicts whether the example is 
from the source or the target domain. Figure 11.6 shows the training across the three 
different layers. 

The prediction loss for (x;,y;) can be written as: 


L£,(Op, Oy) = Ly(Gy (Gy (xis Of); ) 9%) (11.28) 
The domain loss (x;,d;), where d; is the domain, can be written as: 
Li (0a, O¢) = La(Ga(G ¢(x;; Oz); Of), di) (11.29) 


The total training loss for a single layer network can be written as: 


L total (Of, 9, Oa) = Le (Of, A -a(7 Dei (O¢, Og) + zt ly Li 6.04) 


i=n+1 
(11.30) 
The hyperparameter A controls the trade-off between the losses. The parameters 
are obtained by solving the equations: 


(67, 6,) = argmin Lyorai (Of, Oy, Oa) (11.31) 
(07,0) 
(6;) = argmaxL jorai (Op, Oy, 02) (11.32) 
(Oz) 


The gradient updates are very similar to ea stochastic gradient descent with 


a learning rate Lt except the reversal with Ve 5a The gradient reversal layer has no 


parameter and its forward pass is the identity function, while the backward pass 1s 
the gradient from subsequent layer multiplied by —1: 








aki, aki, 
11. 
O¢ <— Of — Se Ae) (11.33) 
0, <— % — He (11.34) 
ie 
0, —— ad (11.35) 


00, 


11.1.1.8 Adversarial Discriminative Domain Adaptation 


Tzeng et al. propose the adversarial discriminative domain adaptation (ADDA) 
which uses a discriminative approach for learning the domain shifts, has no weights 
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Fig. 11.6: Domain-adversarial neural network 


tied between source and the target, and has a GAN loss for computing adversarial 
loss [Tze+17]. 

Let us assume we have source data X, and labels Y, drawn from source distribu- 
tion p;(x,y), and target data X; from a target distribution p;(x,y) with no labels. The 
goal is to learn a target mapping M, and a classifier C; that can classify K categories. 
In adversarial methods, the goal is to minimize the distribution distance between 
the source and target mapping M,(X,) and M,(X;), so that the source classification 
model C; can be used directly on the target so that C = C, = C;. The standard super- 
vised loss can be written as: 


Ss) 


K 
min Letass(Xs,¥s) = — E(x,,y5)~(Xy.¥,) ) 2M ys) logC(M,(X;)) (11.36) 


The domain discriminator D classifies if the data is from source or target, and D 
is optimized using Lady,: 


min Lradvp (Xs, X,,M;,M;) = —lKx, ~x, [log D(M, (x;))] 
— Ex, xx, [log(1 — D(M;(x;)))] (11.37) 
The adversarial mapping loss is given by Laqy,): 


min Ladvy (Xs, Xr,D) = —Ex,~x, [log D(M; (x; ))] (11.38) 
SHUT 

The training happens in phases, as shown in Fig. 11.7. The process begins with 
LadVeigs, OVEX Ms and C, using the labeled data X; and labels Y,. We can then perform 
adversarial adaptation by optimizing Ladyy,Ladvy- 
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Fig. 11.7: Adversarial discriminative domain adaptation 





Discriminator 


11.1.1.9 Coupled Generative Adversarial Networks 


Liu and Tuzel propose a coupled generative adversarial network (COGAN) for learn- 
ing joint distribution between two domains and show it to be very successful in 
computer vision [LQH16]. As discussed in Chap.4, GANSs consist of generative 
and discriminative models. The generative model is used to generate synthetic data 
resembling real data, while the discriminative model is used to distinguish between 
the two. Formally, a random vector z is input to the generative model that outputs 
g(z) that has the same support as the input x. The discriminative model outputs 
f(x) = 1 if drawn from real x ~ px and f(x) = 0 if drawn from synthetic or gen- 
erated x ~ pg. Thus, GANs can be seen as a minimax two-player game solving 
through optimization: 


max min V(f,) = Ex~py(—log f(X)] + Ba~p2[—log(1 — f(e(2)))] (11.39) 


In CoGAN, as shown in Fig. 11.8, there are two GANs for two different domains. 
Generative models try to decode from higher-level features to lower-level features 
as opposed to discriminative models. If x; and x2 are two inputs drawn from the 
marginal distribution of first (x; ~ px,) and second (x2 ~ px,), respectively, then 
generative models GAN; and GAN>2 map a random vector z to examples having the 
same support as x; and x2. The distribution of g;(z) and g2(z) is pg, and pg,. When 
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g and go are realized as MLP, then we can write: 


gi (z) = gy" (gi V(... gi) (gt zy) (11.40) 
go(z) = 9s" (gf Y(---g(e$ (z)))) (11.41) 


where gi and gh are the layers in the corresponding GANSs with layers m, and mp, 
respectively. The structure for first few layers and the weights are identical, thus 
having the constraint of 


B02 fort SO Myccck (11.42) 


where k is the shared layers, and @,: and 0,; are the parameters of gi and gi, re- 
spectively. This constraint enables the first layers that decode high-level features to 
decode it in the same way for both generators g; and go. 

Discriminative models map the input to a probability, estimating the likelihood 
that the input is from the data distribution. If 7 and i correspond to the layers of 
discriminative networks for two GANSs with n, and np layers, it can be written as: 


fila) = FP OC OA a) (11.43) 
fla) = fr? (PC Pe) (11.44) 


where he and fs are the layers in the corresponding f; and f>2 with layers n; and no, 
respectively. The discriminative models work in contrast to the generative models 
and extract low-level features in the first layers and high-level features in the last 
layers. To ensure the data has the same high-level features, we share the last layers 
using: 

0 (nj —i) = 0 (n> —i) fori = 0,1,...d/-—1) (11.45) 

fi 1D) 
where / is the shared layers and @,; and @,; are the parameters of f} and f5, re- 


1 
spectively. It can be shown that learning in CoGAN corresponds to a constrained 
minimax game given by: 


max min V(g1, 82, f1, 2) 


81,82 fi,fr 
subject to O41 = O43, fori = 0,1,...k 
0 (n,-i) = 0 (n> —i) fori = 0,1,...d- 1) (11.46) 
fy th 


where the value function V is given by: 


max min V(g1, 82; f1,f2) = Ex;~py, [~ log( fi) (%1)] + Ex~pz[—log(1 — fi(gi(2)))) 


81582 fi.fo 
+ Ex,~py, [—log(f2(x2))] + Ex~pz[—log(1—fr(ga(z)))] 
(11.47) 
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The main advantage of CoGAN is that by drawing the samples separately 
from the marginal distributions, CoGAN can learn the joint distribution from 
the two domains very effectively. 


11.1.1.10 Cycle Generative Adversarial Networks 


Cycle-consistent adversarial networks (CycleGAN) proposed by Zhu et al. have 
been one of the most innovative generative adversarial networks in recent times 
and have wide applicability in different domains [Zhu+17]. The concept of cycle- 
consistency means that if we translate a sentence from language A to language B 
then translating it from language B to language A should give a similar sentence. 
The main idea is to learn to transfer from source domain X to the target domain Y 
when there are no examples corresponding to them available in the training data. 
This is done in two steps: a) learning a mapping G: X —> Y, such that it is indis- 
tinguishable to know whether data came from G(X) or Y using adversarial loss; and 
b) learning the inverse mapping F : Y —> X and introduce cycle-consistency loss 
so that F(G(X)) =X and G(F(Y)) =Y (Fig. 11.9). 

The learning where G(x) tries to generate data that looks similar to y while the 
discriminator Dy aims at distinguishing G(x) and real y can be expressed as: 


Generators Discriminators 
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Fig. 11.8: Coupled generative adversarial networks 


min max Lgan(G, Dy ,X,Y) (11.48) 
G Dy 


where 


Lcan(G, Dy X,Y) = Eynp jarg(y) los Dy (¥)] + Ex~ para (x) log( — Dy (G(x)))] 
(11.49) 
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Similarly 
min max Lgan(F, Dx, Y,X) (11.50) 
F Dy 


Loan (F, Dx Y,X) = Bywpyayg(x) Og Dx (*)] + Ey vp aara(y) log! — Dx (F(y))) 
(11.51) 
The cycle-consistency loss is about bringing the original data x from the trans- 
lation x — G(x) — F(G(x)) ® x for x domain and y from the translation 
y > F(y) > G(F(y)) © y captured as: 


Leye(GF) = En pagal) ll F(G(8)) —¥ [ht] +Es~pana()lll CFO) —y lt] 1-52) 


Thus the total objective thus can be written as: 


Lrotal(G, F, Dx, Dy) = Lcan(G, Dy,X,Y) + Lean (F, Dx, Y,X) + AL eye(G,F) 
(11.53) 
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Fig. 11.9: Cycle generative adversarial networks (CycleGAN) with forward cycle- 
consistency and backward cycle-consistency 


CycleGAN does not need the data pairs in the domains to match. It can learn 
the underlying relationship and help transfer between the domains. 


11.1.1.11 Domain Separation Networks 


Domain separation networks by Bousmalis et al. have private encoders for learning 
individual domains, shared encoders for learning common representations across 
domains, shared decoder for effective generalization using reconstruction loss, and 
a classifier using shared representations for robustness [Bou+ 16]. 

The source domain D, has Ns labeled data X, = x;,y; and the target domain 
D; has N; unlabeled data X; = x;. Let E,(x;@,) be the function that maps input 
x to a hidden representation h, for a representation that is private for the domain. 
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Let E.(x; 0.) be the function that maps input x to a hidden representation h, that 
is common across the source and the target. Let D(h; 0) be the decoding function 
that maps the hidden representation h to the original reconstruction X. Reconstruc- 
tion can be given by X = D(E,(x) + E,(x)). Let G(h; @,) be the classifier func- 
tion that maps the hidden representation h to predictions ¥ given by ¥ = G(E;(x)). 
Figure 11.10 captures the entire process of DSN. 

The total loss can be written as: 


EL totat( 9c, On, 04, 0.) = Lik + jorecon oF B A difference at Y J similarity C1 1.54) 


where the hyperparameters a, 8B, y control the weight of each loss term. The classi- 
fication loss is the standard negative log-likelihood given by: 


Ns 


A class = — y; -log(¥;) (11.55) 
i=0 


The reconstruction loss is computed using scale invariant mean-squared error: 


Ns 


rec = » totansé (x;, X;) (1 1.56) 
i=0 


The difference loss, as the name suggests, is applied to both domains and is meant 
to capture different aspects of inputs for the private and shared encoders. Let H? and 
Hi! be the rows of matrices which are common between the source and target hidden 
layers. Let H), and H, be the rows of matrices which are private to the source and 
the target hidden layers. The difference loss is given by: 


2 2 
A difference a Ho", Ile i |." |lF (11.57) 


where |- |r is the squared Frobenius norm. 

The domain-adversarial similarity loss, which aims at maximizing the “confu- 
sion,’ is achieved via a gradient reversal layer and a domain classifier to predict the 
domain. If d; € 0,1 is the ground truth of the domains for the data and d; € 0,1 is 
the predicted value of the domain, then adversarial learning can be achieved by: 


Ns+N; 
Oss A p> {dilogd + (1d) og(1 4) } (11.58) 
i=0 


The maximum-mean discrepancy (MMD) loss can also be used instead of the 
DANN described above. 


Domain separation networks capture explicitly and jointly both the private and 
shared components of the domain representations making it less vulnerable to 
noise that is correlated with the shared distributions. 


11.1 Introduction 513 


11.1.2 Theory 


We will describe two topics that have been studied in the last couple of years to 
give a formal mapping to domain adaptation that is applicable in the deep learn- 
ing area. One is the generalization of most domain adaptation networks by Tzeng 
et al. [Tze+17], and another is the optimization transport theory for giving a theo- 
retical foundation to domain adaptation [RHS17]. 


11.1.2.1 Siamese Networks Based Domain Adaptations 


Tzeng et al. present a generalized Siamese architecture which captures most im- 
plementations in domain adaptations using deep learning as shown in Fig. 11.11 
[Tze+17]. The architecture has two streams, the source input which is labeled, and 
the target input which is unlabeled. The training is done with a combination of 
classification loss with either discrepancy-based loss or adversarial loss. The classi- 
fication loss is computed only using the labeled source data. The discrepancy loss is 
computed based on the domain shift between the source and the target. The adver- 
sarial loss tries to capture latent features using the adversarial objective with respect 
to the domain discriminator. This study helps to put all the architectures seen as var- 
ious extensions of the general architecture with changes to how classification loss, 
discrepancy loss, and adversarial loss are computed. 


Private Target Encoder E;,(x*) 


Shared Decoder D(E-(x) + E,(x)) 





Fig. 11.10: Domain separation networks (DSN) 


The setup can be generalized to drawing source-labeled samples (X;,Y;) from a 
distribution p;(x,y) and unlabeled target samples X; from a distribution p; (x,y). The 
goal is to learn from source examples and a classifier C, a representation mapping 
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My, and also have a target mapping M, with a classifier C; at prediction time that 
learns to classify unseen examples in k categories. 

The goal of most adversarial methods is to minimize the distance between the 
distributions of M,(X;) and M;(X;), which implicitly means that in most cases the 
source and target classifiers can be same C = C, = C;. Source classification can be 
given in a generic loss optimization form as: 


K 
min alas &S3¥s) = —E, (x5 ,y5)~(X5,¥s) ) 2 A ys] LOBE (Ms(x;)) (11.59) 


Ms, 


A domain discriminator D which classifies whether the data is drawn from the 
source or the target can be written as: 
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Fig. 11.11: Siamese networks for generalizing the domain adaptation implementa- 
tion 


min Ladvp (Xs, X7,Ms,Mr) = —Ex,~x, [log D(Ms (xs) )] 
— Ex, x, [log(1—D(Mr(x,)))] (11.60) 


With the source and target mapping constraints given by w(Ms, Mr), a discriminator 
D that can distinguish between them can be captured as an adversarial objective 


dang: 


min Dadvy(Xs,Xr,Ms,Mr) ae Ladvy(Xs,Xr,D) s.tyw(Ms,Mr) (11.61) 


S> 


Various techniques described in domain adaptation can now be understood with this 
general framework. 

The gradient reversal process can be written in terms of optimizing the discrimi- 
nator loss directly as Lady, = —Ladvp- 

When using GANs there are two losses: the discriminator loss and the generator 
loss. The discriminator loss Lag, remains the same, while generator loss can be 
written as: 
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min Ladvy (Xs; Xr, D) = —Ex,.x; [log D(Mr (x;))| (11.62) 


The domain confusion loss can be written as minimizing the cross-entropy loss 
given by: 


1 1 
min Ladvy (Xs, X7,D)=— Y Exjnxp 5 log D(Ma(xa)) + 5 log(1 —D(Ma(xa))) 
dés,t 


(11.63) 


11.1.2.2 Optimal Transport 


In the last few years, optimal transport theory has come into prominence from var- 
ious Statistical, optimization, and machine learning perspectives. Optimal transport 
can be seen as a way of measuring the transport of data between two different dis- 
tributions that are based on the geometry of the data points in the two and has a 
cost function related to transportation [Mon81]. This transport mechanism maps 
very well to domain adaptation, where the source and target domains can be seen 
as two different distributions, and optimal transport explains from both theory and 
optimization the mapping. The Wasserstein distance in optimal transport which is 
used to measure the distance between two distributions can also be used as a min- 
imization objective or regularization function in the overall loss function. Optimal 
transport has been used to give a good generalization bound to deep domain adap- 
tation frameworks [RHS17]. 


11.1.3 Applications in NLP 


Glorot et al. in the very early days of deep learning showed how stacked autoen- 
coders with sparse rectifier units could learn feature-level representations that could 
perform domain adaptation on sentiment analysis very effectively [GBB1 Ic]. 

Nguyen and Grishman employ word embeddings along with word clustering 
features to show that domain adaptation in relation extraction can be very effec- 
tive [NG14]. Nguyen et al. further explore the use of word embeddings and tree 
kernels to generate a semantic representation for relation extraction and improve- 
ments over feature based methods [NPG15]. Nguyen and Grishman show how basic 
CNNs with word embeddings, position embeddings, and entity type embeddings as 
input can learn effective representation that gives a good domain adaptation method 
for event detection [NG15a]. Fu et al. show the effectiveness of domain adaptation 
for relation extraction using domain-adversarial neural networks (DANN) [Fu+17]. 
They use word embeddings, position embeddings, entity type embeddings, chunk- 
ing, and dependency path embeddings. They use CNNs and DANN with a gradient 
reversal layer to effectively learn the relationship extraction with cross-domain fea- 
tures. 
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Zhou et al. use a novel bi-transferring deep neural networks to transfer source 
examples into the target and vice versa for achieving close to the state-of-the-art re- 
sults in sentiment classification [Zho+16]. Zhang et al. use the mapping between the 
keywords to the source and target and employ it in adversarial training for domain 
adaptation in classification [ZBJ17]. Ziser and Reichart show how pivot features 
(common features that are present in source and target) along with autoencoders can 
learn representation that is very effective in domain adaptation for sentiment clas- 
sification [ZR17]. Ziser and Reichart further extend the research to a pivot-based 
language model in a structure-aware manner that can be employed for various clas- 
sification and sequence-to-sequence-based tasks for improved results [ZR18]. Yu 
and Ziang combine the ideas of structural correspondence learning, pivot-based fea- 
tures, and joint-task learning for effective domain adaptation in sentiment classifi- 
cation [YJ16]. 


11.1.4 Applications in Speech Recognition 


Falavigna et al. show how deep neural networks and automatic quality estimation 
(QE) can be used for domain adaptation [Fal+17]. They use a two-step process in 
which first manually labeled transcripts are used for evaluating WER on the data for 
different quality. Then adaptation is made on unseen data according to WER scores 
by the QE component to show significant improvements in performance. 

Hosseini-Asl et al. extend the CycleGAN concepts to have multiple discrim- 
inators (MD-CycleGAN) for unsupervised non-parallel speech domain adapta- 
tion [Hos+18]. They use multiple discriminator-enabled CycleGAN to learn fre- 
quency variations in spectrograms between the domains. They use different gender 
speech ASR in training and testing to evaluate the domain adaptation aspect of the 
framework and report a good performance by using the MD-CycleGAN architecture 
on unseen domains. 

Adapting to different speakers with different accents is one of the open research 
problems in speech recognition. Wang et al. in their work do a detailed analysis 
treating this as a domain adaptation problem with different frameworks to give im- 
portant insights [Wan+ 18a]. They use three different speaker adaptation methods, 
such as linear transformation (LIN), learning hidden unit contribution (LHUC), and 
Kullback—Leibler divergence (KLD) on ai-vector based DNN acoustic model. They 
show that based on the accents using one of the methods, ASR performance can be 
significantly improved for not only medium-to-heavy accents but also for slight- 
accent speakers. Sun et al. use domain-adversarial training for solving accented 
speech in ASR [Sun+18a]. Employing domain adversarial training in the learning 
objective from unlabeled target domain with different accents to separate source and 
target while using labeled source domain for classification, they show a significant 
drop in error rates for unseen accents. 

Improving ASR quality in the presence of noise by improving the robustness 
of the models can also be approached from the domain adaptation view based 
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on how the noise in target domain or unseen data is different from the source 
domains. Serdyuk et al. use GANs for domain adaptation in unseen noisy target 
datasets [Ser+16]. The model has the encoder, decoder, and the recognizer with a 
hidden representation in between that is used to perform dual tasks of improving 
the recognition and minimizing domain discrimination. They show their method to 
be better at generalization when the target domain has more noise categories than 
the ones used in the source training data. Sun et al. use adversarial data augmenta- 
tion using fast gradient sign method (FSGM) to show significant improvements in 
the robustness of acoustic models [Sun+18b]. Meng et al. use domain separation 
networks (DSN) for domain adaptation between source and targets for robustness 
on target data with different noise levels [Men+17]. The shared components learn 
the domain invariance between the source and the target domains. The private com- 
ponents are orthogonal with the shared ones and learn to increase domain invari- 
ance. They show a significant decrease in the WER over baseline with an unadapted 
acoustic model with their approach. 


11.2 Zero-Shot, One-Shot, and Few-Shot Learning 


The extremes of domain adaptation or transfer learning problem are when there are 
limited training examples to match the test example. The best example is the facial 
recognition problem from computer vision, where there is exactly | training exam- 
ple for each person, and when someone appears, the need is to match the existing or 
classify it as a new unseen. Based on a number of training examples corresponding 
to the unseen example we get at prediction time, there are different flavors such as 
zero-shot learning, one-shot learning, and few-shot learning. In the next sections, 
we will discuss each of them and techniques that have been popular to address them. 


11.2.1 Zero-Shot Learning 


Zero-shot learning is a form of transfer learning where we have absolutely no train- 
ing data available for the classes we will see in the test set, or when the model is used 
for predictions. The idea is to learn a mapping from classes to a vector, in such a way 
that an unseen class in the future can be mapped to the same space, and “closeness” 
to the existing classes can be used to provide some information about the unseen 
class. An example from the NLU domain would be when the data is available about 
computers and knowledge bases (KB) exist for retrieving information about them, a 
question on “what is the cost for specific part for a function such as the display” can 
be formed as a query to a KB having a database of components, subcomponents, 
functions, and parts. Learning this mapping can be used to transfer it to another 
completely different domain. For example, this can be used in car-manufacturing 
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on similar queries if the cost of parts for performing specific functions is normally 
used. 


11.2.1.1 Techniques 


We will illustrate a general method and variations that have been successful in com- 
puter vision and language/speech understanding/recognition tasks [XSA17]. 

The approach is to measure the similarity between source and target domains. In 
computer vision, for example, one way is to map the label space to a vector space 
based on side information, such as attributes capturing the picture. The attributes 
can be meta-level or image-level features, such as “presence of specific color,’ “size 
of the object,” and others. The vector representation can be a one-hot vector of these 
attributes. The source data features are embedded in the source feature space. The 
next step is to find the compatibility between the source feature space, as shown in 
Fig. 11.12, using a compatibility function. 


Input space X Feature space X Label space Y Labels Y 
F (xi, Yi W) 






yi 


F(x), Yj; Ww) 
F(x, y;W) = 8(x)'We(y) 


Fig. 11.12: Zero-shot learning 


Formally, the source dataset S = {(%,¥n),n = 1,---N} with input and labels 
Xn € X, vn € Y, respectively. The goal is to learn a function f(x) that minimizes 
the loss in predicting the label y and can be written using the minimization of the 
empirical risk in the form: 


M= 


1 
N L(yn, f (Xn)) (11.64) 


n=! 


where L is a loss measuring function. For classification, it can be 0 when matching 
and 1 when not matching. Let @ be the source embedding function that transforms 
the input data to its feature space, ie., 9: X > X. Similarly, let @ : Y > Y be the 
label embedding function that transforms the labels into a space using the attributes. 
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The compatibility function F' : X x Y — and the function f are defined in terms of 
model parameters w of F’, i.e., how the pair (x,y) are compatible given the parameter 
Ww: 

f(x;w) = arg maxF (x, y;w) (11.65) 
yed 


Different forms of compatibility functions exist and are mentioned below: 


1. Pairwise Ranking: A popular method which uses convex objective, pairwise 
ranking, and SGD updates is given by: 


»y [AQns¥) + F (Xn, y3W) — F (Xn, yn W) I. (11.66) 


ye y train 


where A is the 0/1 loss, and F is the linear compatibility function. 
2. Weighted Pairwise Ranking: An extension to the above which adds weights as 
in: 
» K|A(yn,y) +F (xn, y3W) — F (xn, ynsW) 4 (11.67) 


yeylrain 


where J; = ae 0; G = ‘, and k is the number of ranks. 

3. Structured Joint Embedding (SJE): Another pairwise ranking but for multiclass 
scenario, where one uses the max function to find the most violating class, is 
given by: 

max [A(yn,y) + F(%n,y3W) —F (xn, ynsW) |. (11.68) 
dyeytrain 

4. Embarrassingly Simple Zero-Shot Learning: Extension to the above SJE method 
where a regularization term is added: 


VIWo(r)|F +A |]6(x) WIP + BIW? (11.69) 


where y,/, B are the regularization parameters. 
5. Semantic Autoencoder: Another technique that uses linear autoencoder to project 
from 0(x) to @(y) space: 


min|| (x) —WT@(y)||" + A||WO(x) — PO)IP (11.70) 


6. Latent Embeddings: To overcome the limitations of the linear weights W, a 
piecewise-linear modification is made to the compatibility function to achieve 
non-linearity as given by: 


F (x,y;W) = 0(x)'We(y) (11.71) 


where W; are different linear weights learned. 

7. Cross Model Transfer: Performing non-linear transformation using two layered 
neural networks with weights W; and W> and objective function is another non- 
linear technique: 


520 11 Transfer Learning: Domain Adaptation 


XY ¥ |e) — W, tanh(W26(x)))|| (11.72) 


ye yfrain xEX 


8. Direct Attribute Prediction: Another technique uses attributes associated with the 
class to be learned directly, given by: 


M Cc 
f(x) =argmax [| P(am'*) (11.73) 


c m=1 P(Gn) 


where M is the total number of attributes, a, is the mth attribute of class c, and 
p(at, |x) is that attribute probability associated with the given data x. 


11.2.2 One-Shot Learning 


The general problem in one-shot learning is to learn from a dataset where there is 
one example for a class. The same general form is used for the similarity function 
in representation between the training examples, so that during the prediction the 
similarity function is used to find the closest available example in the training data. 


11.2.2.1 Techniques 


The Siamese network-based architectures with variations are generally the common 
way to learn similarities in these frameworks. The network parameters are learned 
through pairwise learning from the training dataset, as shown in Fig. 11.13. One 
variation is that, instead of the fully connected layers going to a softmax layer, 
features or encoding of the input can be used for similarity; the resulting are called 
matching networks. One way to learn the parameters of the network is during train- 
ing time to minimize the difference when inputs are similar and maximize when 
dissimilar while during prediction to use the learned representation to compute sim- 
ilarity with existing training samples. If x; and x; are two examples from the training 
data, the similarity function can be the difference between the two predictions in the 
Siamese networks given by: 


d(xi,x;) = ||f 0a) — F,)II3 (11.74) 


Another way to learn the parameters is through the triplet loss function by 
Schroff et al. [FSP15]. The idea is to pick an anchor data x4 for which a positive 
xp and a negative xy sample are used to learn parameters of the network, so that the 
difference between anchor data and positive data is maximized, and the difference 
between anchor data and the negative is minimized: 


£(x4,xp,4n) = max(||f(xa) — f(xP)|I3 — |Lf(@a) — fw) |3+,0) (11.75) 
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Fig. 11.13: One-shot learning 
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In Eq. 11.75, the parameter @ is similar to the margin in SVMs. The training 
data is used to generate the triplets, and stochastic gradient method can be used for 
learning the parameters with this loss function. 


11.2.3 Few-Shot Learning 


Few-shot learning is a relatively easier form of learning as compared to the previous 
two. In general, most of the techniques mentioned in one-shot learning can be used 
for few-shot learning too, but we will illustrate a few additional techniques that have 
been successful. 


11.2.3.1 Techniques 


The deep learning techniques for few-shot learning can be described as either data- 
based approaches or model-based approaches. In the data-based approach, some 
form of augmenting the training data in different forms is the general process uti- 
lized to increase the number of similar samples. 

In contrast, the model- or parameter-based approach enforces regularization 
in some form to prevent overfitting of the model from the limited training sam- 
ples. Donghyun et al. use the interesting idea of correlating activations from in- 
put data to form “groups” of similar neurons or parameters per layer in the source 
data [Yoo+18]. The hyperparameter “number of groups” per layer is chosen using k- 
means clustering algorithm, and k is further learned using reinforcement techniques. 
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Once they are trained on source dataset, these groups of neurons are fine-tuned on 
the target domain using group-wise backpropagation. As the number of parame- 
ters increases, with small training data for each category, optimization algorithms 
such as SGD are not effective. Mengye et al. propose a meta-learning approach for 
solving this in two steps: (a) a teacher model learns from a large amount of data to 
capture the parameter space, and (b) then guides the actual pupil or a classifier to 
learn using the parameter manifold giving excellent results [Ren+18]. 


11.2.4 Theory 


Palatucci et al. present a semantic output code mapping classifier as a theoreti- 
cal base and formalization for zero-shot learning [Pal+09]. The classifier mapping 
helps to understand how the knowledge base and the semantic features of outputs get 
mapped, and how the learning can happen even when the novel classes are missing 
from the training data using the PAC framework. 

Fei-Fei Li et al. propose a Bayesian framework for giving a theoretical base to 
one-shot learning in the object identification domain [FFFP06]. By modeling prior 
knowledge of the data as probability density function on the parameters with these 
models, posteriors being the categories of objects, Bayesian framework shows how 
models carry information even with a very few examples in training to correctly 
identify the categories. 

Triantafillou et al. propose an information retrieval framework and implementa- 
tion for modeling few-shot learning [TZU17]. This paper proposes learning a sim- 
ilarity metric for mapping objects into a space, where they are grouped based on 
their similarity relationship. The training objective optimizes relative orderings of 
the data points in each training batch to leverage importance in the low data regime. 


11.2.5 Applications in NLP and Speech Recognition 


Most of the applications of zero-shot, one-shot, and few-shot learning have been 
in computer vision. Only recently have there been applications in NLP and speech. 
Pushp and Srivastav employ zero-shot learning in classification for text categoriza- 
tion [PS17]. The source dataset is the news headlines crawled from the web, and 
the categories are the search engine. The target test data is the UCI news and the 
tweets categorization dataset. They employ different neural architectures based on 
how and what one feeds to the LSTM networks. The trained model is then applied to 
the dataset which has not seen the relationships before (UCI news and tweets) to get 
very impressive results, showing the effectiveness of zero-shot learning methods. 
Levy et al. employ zero-shot learning in relation extraction by learning to answer 
questions from a corpus [LS17]. Yogatama et al. try to explore RNNs as genera- 
tive models and empirically show the promise of generative learning in a zero-shot 
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learning setting [Yog+17]. The relationship is learned by posing questions and hav- 
ing sentences in the answers that map to an entity, where the relationship is men- 
tioned from slot-filling datasets such as WikiReading. They show that even on the 
unseen relationship, the zero-shot learning shows enough promise as a methodol- 
ogy. Mitchell et al. employ zero-shot learning using explanations about the labels or 
categories to learn the embedding space using constraints and show good results on 
email categorization [MSL18]. 

Dagan et al. propose a zero-shot learning framework for the event extrac- 
tion problem using event ontologies and small, manually annotated labeled 
datasets [Dag+18]. They show transferability to even unseen types and additionally 
report results close to the state of the art. 

Yan et al. address the difficult short text classification problem by using few-shot 
learning [YZC18]. They use the Siamese CNNs to learn the encoding that distin- 
guishes complex or informal sentences. Different structures and topics are learned 
using the few-shot learning method and shown to generalize and have better accu- 
racies than many traditional and deep learning methods. 

Ma et al. have proposed a neural architecture for both few-shot and zero-shot 
learning on fine-grained named entity typing, i.e., detecting not only the entity from 
the sentence, but also the type (for example, “John is talking using his phone” 
not only identifies “John” as the entity, but can also decode that “John” is the 
speaker [MCG16]). They use prototypical and hierarchical information to learn label 
embeddings and give a huge performance boost for the classification. In their work, 
Yazdani and Henderson use zero-shot learning for spoken language understanding, 
where they assign label actions with attributes and values from the utterance output 
of ASR dialogs [YH15]. They build a semantic space between the words and the la- 
bels, so that it can form a representation layer that predicts unseen words and labels 
in very effective way. 

Rojas-Barahona et al. have shown the success of deep learning and zero-shot 
learning in semantic decoding of spoken dialog systems. They use deep learning 
for learning features jointly from known and unknown categories [Roj+18]. They 
then use unsupervised learning to tune the weights, further using risk minimization 
to achieve zero-shot learning when tested on unseen data with slot pairs not known 
in the training set. Keren et al. use one-shot learning with Siamese networks to 
compute the similarity between the single exemplars from the source data to the 
unseen examples from the target data in the spoken term detection problem in the 
audio domain [Ker+18]. 


11.3 Case Study 


We will go through a detailed case study to explore and understand different things 
discussed in the chapter from a practical point of view. We chose the Amazon prod- 
uct review dataset published in the research by Blitzer et al. [BDPO7] for the senti- 
ment classification task. The dataset has reviews for various product domains such 
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as books, DVDs, kitchen, and electronics. All the domains have 2000 labeled ex- 
amples with binary labels (positive and negative) based on the reviews. Kitchen 
and electronics domain has large number of unlabeled examples as well. In our ex- 
periments we have not used the unlabeled examples but treated many labeled as 
unlabeled when required. 

We chose two different cases: (1) the source domain is kitchen and the target 
domain is electronics and (2) the source domain is books and the target domain is 
kitchen for our experiments. We divided all the datasets into training and testing 
with 1600 and 400 examples, respectively. The validation data is chosen from the 
training dataset either as a percent or a stratified sample. Though the goal is not to 
replicate papers or fine-tune each method to get best results, we have done some 
parameter tuning and kept most parameters standard or constant to see a relative 
impact. 


11.3.1 Software Tools and Libraries 


We will describe the main open source tools and libraries we have used below for 
our case study. There are some open source packages for specific algorithms which 
we have either used, adapted, or extended that are mentioned in the notebook itself: 


Keras (www.keras.io) 

TensorFlow (https://www.tensorflow.org/) 
Pandas (https://pandas.pydata.org/) 
scikit-learn (http://scikit-learn.org/) 
Matplotlib (https://matplotlib.org/) 


11.3.2 Exploratory Data Analysis 


Similar to the other case studies we will be performing some basic EDA to under- 
stand the data and some of its characteristics. The plots shown in Fig. 11.14a, b 
show the word distribution bar charts across entire corpus of source and target for 
the sentiments. It clearly shows that going from the domain kitchen to electronics 
may not be that different as going from books to kitchen reviews. 

The plots shown in Fig. 11.15a—c illustrate the word cloud for the positive senti- 
ment data across books, kitchen, and electronics reviews. Just visually exploring 
some high frequency words, the similarity between the word cloud of kitchen— 
electronics as also the differences between books—kitchen is very evident. The plots 
shown in Fig. 11.16a—c depicting the word cloud for the negative sentiment data 
across books, kitchen, and electronics reviews also illustrate the same characteris- 
tics. 
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11.3.3, Domain Adaptation Experiments 


We next describe in detail all the experiments we carried out with transfer learning 
techniques in the form of training process, model, algorithms, and changes. Again, 
the goal was not to get the best tuned models for each but understand practically how 
each technique with its biases and processes performs on some of these complex 
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Fig. 11.14: Word distribution comparisons in quartiles of 25, 50, and 75%. (a) Books 
and kitchen comparison. (b) Kitchen and electronics comparison 
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Fig. 11.15: Word cloud for positive sentiments from (a) books, (b) kitchen, and (c) 
electronics data, respectively 
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Fig. 11.16: Word cloud for the negative sentiments from (a) books, (b) kitchen, and 
(c) electronics data, respectively 
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real-world tasks. We carry the experiments for both books—kitchen and kitchen— 
electronics as our source—target domains. We use classification accuracy as a metrics 
to see the performance as the test data was having equal number of positive and 
negative sentiments. 


11.3.3.1 Preprocessing 


We perform some basic preprocessing on the raw data for carrying out the sentiment 
classification tasks. The data is parsed from XML based documents, tokenized into 
words with basic stop words removed and some basic padding of sequences to give a 
constant maximum length representation for each. We create a vocabulary of words 
by finding all the words from source and the target sealed with maximum size of 
15000 for most experiments. For some of the vector—space models we use bag-of- 
words representation with n-grams of size 2 and maximum features of size 10000. 


11.3.3.2 Experiments 


We will use Kim’s CNN model shown in Fig. 11.17 as our classifier model in most 
experiments. We used the standard GloVe embeddings with 100 dimensions which 
were trained on 6 billion words. We will list the name of the experiments and the 
purpose behind it below: 


1. Train Source + Test Target: The goal is to understand the transfer learning 
loss that happens when you train on the source data and test only on the target 
data due to domain change. This as we discussed can happen incrementally over 
time or due to completely different environment where the model 1s deployed. 
This gives a basic worst case analysis for our experiments. 

2. Train Target + Test Target: This experiment gives us the best case analysis 
for the model which has not seen the source data but completely trained on the 
target training data and predicted on the target test data. 

3. Pre-trained Embeddings Source + Train Target: We try to understand the 
impact of unsupervised pre-trained embeddings on the learning process. The 
embedding layer is frozen and non-trainable in this experiment. We train the 
model on the target domain using unsupervised embeddings and test on the 
target test set. 

4. Pre-trained Embeddings Source + Train Target + Fine-tune Target: We 
train the model on the target domain using unsupervised embeddings but fine- 
tune the embeddings layer with the target train data. 

5. Pre-trained Embeddings + Train Source + Fine-tune Source and Target: 
This can be the best case of pre-training and fine-tuning where you get the 
advantage of learning embeddings from unsupervised, train on the source, fine- 
tune on the target, and thus have more examples for learning useful representa- 
tion across. 
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Fig. 11.17: Kim’s CNN model 


6. Stacked Autoencoders and DNNs: We use the DNNs with stacked autoen- 
coders for the unsupervised latent feature representation learning. Train the 
model on the source domain in unsupervised way. Fine-tune the model with 
new classification layers on the target training data and test on the target data. 

7. Stacked Autoencoders and CNNs: Goal is to understand the impact of la- 
tent representation learned from unseen source data on the target domain using 
CNNs for autoencoders as shown in Fig. 11.18a, b. 

A sample code that shows how the autoencoder is constructed: 
input_layer = Input(shape=(300, 300) ) 
7 encoding layers to form the bottleneck 


1 
3 encoded_hl = Dense(128, activation=’tanh’)(input_i= 


layer ) 
4 encoded_h2 = Dense(64, activation=’tanh’)(encoded_hl ) 
5 encoded_h3 = Dense(32, activation=’tanh’ )(encoded_h2 ) 
6 encoded_h4 = Dense(16, activation=’tanh’ )(encoded_h3 ) 
7 encoded_h5 = Dense(8, activation=’ tanh’ )(encoded_h4 ) 
8 # latent or coding layer 
9 latent = Dense(2, activation=’ tanh’ )(encoded_hS ) 
10 # decoding layers 
ul decoder_hl = Dense(8, activation=’tanh’)(latent ) 


12 decoder_h2 = Dense(16, activation=’tanh’)(decoder_hl ) 
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Fig. 11.18: Unsupervised training from source using (a) DNN and (b) CNN autoen- 
coders and further trained/tested on target with classification layer in Keras 


13 decoder_h3 = Dense(32, activation=’tanh’ )( decoder _h2 ) 
14 decoder_h4 = Dense(64, activation=’tanh’ )(decoder_h3 ) 
15 decoder_h5 = Dense(128, activation=’ tanh’ )(decoder_h4 ) 
16 # output layer 

17 output_layer = Dense(300,activation=’ tanh’ )(decoder_h5 ) 
18 # autoencoder using deep neural networks 

19 autoencoder = Model(input_layer , output_layer ) 

20 autoencoder .summary () 

21 autoencoder.compile(’adadelta’, ’mse’) 

22 


Using the autoencoder with encoding layers for classification: 


# create a sequential model 


1 

2 classification_model = Sequential () 

3 # add all the encoding layers from autoencoder 
4 classification_model.add(autoencoder.layers[0]) 
5 classification_model.add(autoencoder.layers[1]) 
6 classification_model.add(autoencoder. layers [2]) 
7 classification_model.add(autoencoder. layers [3]) 
8 classification_model.add(autoencoder. layers [4]) 
9 classification_model.add(autoencoder. layers [5]) 
10 7 Llatten thie (Output 


i classification_model.add( Flatten () ) 
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10. 


12 Fo classtiicatton. layer 

13 classification_model .add(Dense(2, activation=’ softmax ’ ) ) 

14 classification_model.compile(optimizer=’rmsprop’ , 

15 Loss = 
CalLeeoricalecrosseniropy , 

16 metrics =| accuracy |) 


Marginalized Stacked Autoencoders: The goal of this experiment is to un- 
derstand the impact of mSDA architecture in domain adaptation [Che+12]. We 
first learn the joint representation using the source and target data. Next, we use 
the last layer as the feature layer from mSDA concatenated with input layers to 
train SVM from labeled source data and predict on unlabeled target test data. 
Second-Order Statistical-based Method (Deep CORAL, CMD, and 
MMD): The goal is to see if the target data is unlabeled, can a second-order 
statistical-based method that can learn from source and target be useful in 
predicting the target under the domain shift. 

Domain-Adversarial Neural Network (DANN): The goal is to see if the target 
data is unlabeled, can an adversarial based method that can learn from source 
and target be useful in predicting the target under the domain shift (Table 11.1). 


11.3.3.3 Results and Analysis 


Table 11.1: Domain adaptation experiments on two different datasets for analyzing 
impact of source—target domain shift 


Experiment 


Source (books) {Source (kitchen) 
and target and target 
(kitchen) test (electronics) test 
accuracy accuracy 





Train source + test target | 69.0 78.00 
Train target + test target 82.5 
Pre-trained embeddings source + train target 80.25 
Pre-trained embeddings source + train target + fine-tune 94 5 
target 

Pre-trained embeddings + train source + fine-tune source 86.75 
and target 

Stacked autoencoders and DNNs 63.75 
Stacked autoencoders and CNNs 79.25 
Marginalized stacked autoencoders 69.75 
CORAL 69.25 
CMD 69.25 
MMD 69.25 
DANN 80.0 
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Some of the observations and analysis of the results are given below: 


1. Books to kitchen has higher domain transfer loss, with train accuracy 
(78.25) and test accuracy (69.00) it is 9.25, as compared to kitchen to 
electronics, with train accuracy (83.75) and test accuracy (78.00) it is 
5.75. The word cloud and the data distribution confirm that reviews writ- 
ten for books are very different as compared to kitchen and electronics. 

2. Using the pre-trained embeddings has an impact and the incremental im- 
provement seen going from just the frozen embeddings to the embeddings 
trained on source and target justifies the transfer learning. 

3. One of the best results is seen for both books—kitchen and kitchen— 
electronics is when the pre-trained embeddings are used, trained end-to- 
end first on source and then on the target. Thus, the advantage of learning 
unsupervised and fine-tuning to adapt to the domain shift is very evident. 

4. Stacked autoencoders with CNNs show better results than with plain 
DNNs, proving the effectiveness of autoencoders in capturing the latent 
features and layered CNNs in capturing the signals for classification. 

5. Most of the statistical techniques such as CORAL, CMD, and MMD don’t 
show good performance 

6. Adversarial methods such as DANN show a lot of promise with just the 
shallow networks. 


11.3.4 Exercises for Readers and Practitioners 


Some other interesting problems readers and practitioners can attempt on their own 
include: 


1. 


2, 


What will be the impact of combining source and target training data together 
and test on unseen target test? 

What will be the impact of using labeled and unlabeled data from source and 
target to learn embeddings and then with various techniques? Do sentiment based 
embeddings give better results than general embeddings? 


. What will be the impact of different embedding techniques learned in Chap. 5 on 


the experiments? 


. What will be the impact of different deep learning frameworks for classification 


that we learned in Chap. 6 on the experiments? 


. What will we see with other domain adaptation techniques such as CycleGAN 


or CoGAN? 


. What will be the transfer loss and improvements on other source—target such as 


DVD-kitchen? 


. Which of these techniques can be employed for speech recognition transfer learn- 


ing problems? 
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Chapter 12 _e 
End-to-End Speech Recognition Giese 


12.1 Introduction 


In Chap. 8, we aimed to create an ASR system by dividing the fundamental equation 


W* = argmax P(W|X) (12.1) 
Wev* 


into an acoustic model, lexicon model, and language model by using Bayes’ the- 
orem. This approach relies heavily on the use of the conditional independence as- 
sumption and separate optimization procedures for the different models. 

Deep learning was first incorporated into the statistical framework by replacing 
the Gaussian mixture models to predict phonetic states based on the observations. 
One of the drawbacks of this approach is that the DNN/HMM hybrid models rely 
on training each component separately. As seen previously in other scenarios, the 
separate training process can lead to sub-optimal results, due to the lack of error 
propagation between models. In ASR, these drawbacks tend to manifest themselves 
as sensitivity to noise and speaker variation. Applying deep learning for end-to- 
end ASR allows the model to learn from the data instead of relying on heavily 
engineered features, allowing the models to learn from the data directly. Thus, there 
have been some approaches to train ASR models in an end-to-end fashion. End- 
to-end methods instead try to optimize the quantity P(W|X) directly, rather than 
separating it. 

With end-to-end modeling, the input-target pair need only be the speech utter- 
ance and the linguistic representation of the transcript. Many representations are 
possible: phones, triphones, characters, character n-grams, or words. Given that 
ASR focuses on producing word representations from the speech signal, words are 
the more obvious choice; however, there are some drawbacks. The vocabulary size 
requires large output layers, as well as examples of each word in training, leading 
to much lower accuracies than other representations. More recently, end-to-end ap- 
proaches have moved towards using characters, character n-grams, and some word 
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models as well given enough data. These data pairs can be easier to produce, al- 
leviating the requirements for linguistic knowledge when creating phonetic dictio- 
naries. Jointly optimizing the feature extraction and sequential components together 
provides numerous benefits, specifically: lower complexity, faster processing, and 
higher quality. 

The key component to accomplishing end-to-end ASR requires a method of re- 
placing the HMM to model the temporal structure of speech. The most common 
methods are CTC and attention. In this chapter, the components of traditional ASR 
are substituted with end-to-end training and decoding techniques. We begin by in- 
troducing CTC, a method for training unaligned sequences. Next, we explore some 
architectures and techniques that have been used to train end-to-end models. We 
then review attention and how to apply it to ASR networks and some of the architec- 
tures that have been trained with these techniques. Following attention, we discuss 
multitask networks trained with both CTC and attention. We explore common de- 
coding techniques for CTC and attention during inference, incorporating language 
models to improve prediction quality. Lastly, we discuss embedding and unsuper- 
vised techniques and then end with a case study, incorporating both a CTC and an 
attention network. 


12.2 Connectionist Temporal Classification (CTC) 


DL-HMM models rely on an alignment of the linguistic units to the audio signal to 
train the DNN to classify as phonemes, senones, or triphone states (plainly stated, 
this sequence of acoustic features should yield this phone). Manually obtaining these 
alignments can be prohibitively expensive for large datasets. Ideally, an alignment 
would not be necessary for an utterance-transcript pair. Connectionist temporal clas- 
sification [Gra+06] was introduced to provide a method of training RNNs to “label 
unsegmented sequences directly,’ rather than the multistep process in the hybrid 
mode. 

Given an acoustic input X = [x),xX2,...,x7] of acoustic features with the desired 
output sequence Y = [y,,y2,..., yy]. An accurate alignment of X to Y is unknown, 
and there is additional variability in the ratio of the lengths of X and Y, typically with 
T < U (consider the case where there is a period of silence in the audio, yielding a 
shorter transcript). 

How can possible alignments be constructed for a (X,Y) pair? A simple theo- 
retical alignment, as shown in Fig. 12.1, illustrates a potential method where each 
input x; has an output assigned to it, with repeated outputs combined into a single 
prediction. 

This alignment approach has two problems: First, in speech recognition, the input 
may have periods of silence that do not directly align with the output assigned. 
Second, we can never have repeated characters (such as the two “T’s in “hello’’), 
because they are collapsed together into a single prediction. 

The CTC algorithm alleviates the issues of this naive approach by introducing a 
blank token that acts as a null delimiter. This token is removed after collapsing re- 
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Input Acoustic Features X: X4 X92 X3 X4 Xs X6 
Naive Alignment: Cc Cc tl ad ad [ 
Output Y: C a i 


Fig. 12.1: Naive alignment for an input X and with a length of 6 and an output 
Y = |c,a,t]. Example from [Han17] 


peated predictions, allowing repeated sequences and periods of “silence.” Thus, the 
blank token is not included in the loss computation or decoding; however, it allows 
a direct alignment between the input and the output without forcing a classification 
to the output vocabulary. Note that the blank token is separate from the space token 
that is used to denote the separation of words. Figure 12.2 shows an example of the 
CTC alignment. 


Input Acoustic Features X: | %q |X2 Xz |Xq4 |Xse Xe_ X7 | Xg |Xq Xi0 X11 X12 X13 X14 X15 X16 X17 


Predicted Alignment: Aje| f| b| &| flo] &] &|_ |wlel|e@| ei rl) i ld 

Merge Repeated Predictions: fi e / £ / rif £ = Ww Oo £ F [ a 

Remove flank taken: h 2 | / o _ |W ft F / ad 
CTC predicted output Y: h e / / oO woo r / d 


Fig. 12.2: CTC alignment for an input X and output Y = [h,e,1,l,0,_,w,o,r1,d]. 
Note that the blank token is represented by “e” and the space character is represented 


66 99 


by the underscore “*_ 


With this output representation, there is a 1:1 alignment between the lengths of 
the input sequence and output sequence. Furthermore, the introduction of the blank 
token implies that there can be many predicted alignments that lead to the same 
output. For example: 


|h,e,l,€,l,0,€] > “hello” 
|h,€,e,1,€,l,0] > “hello” 


Because any token in the output can have an € before or after, we can imagine 
the desired output sequence having an € before and after each label. 


Y = IE,V1,€,¥2,---,€, yu] 


Multiple paths/alignments can yield a correct solution, and therefore, all correct 
solutions must be considered. The CTC algorithm itself is “‘alignment-free”’; how- 
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ever, these “pseudo-alignments” are used to compute the probability of possible 
alignments. 

It then produces an output distribution over all possible Ys, which can be used to 
infer the probability of a particular output, Y. The conditional probability, P(Y |X), 
is computed by summing over all possible alignments between the input and the 
output, as shown in Fig. 12.3. 





xy X> X3 X4 Xs X6 Xz Xg 


Fig. 12.3: Valid CTC paths for the target sequence, Y = [a,b]. Notice that the blank 
token, €, is removed from the final sequence. Therefore, there are two possible initial 
states, € and ja, and two possible final states, € and b. Additionally, to achieve final 
output, the transition from epsilon must be to itself or the next token in the sequence, 
while the transition from a could be to itself, €, or b 


Mathematically we can define the conditional probability of single alignment a, 
as the product of each state in the sequence: 


P(a|X) = T1P(ailx) (12.2) 


t=1 


All paths are considered mutually exclusive, so we sum the probability of all 
alignments, giving the conditional probability for a single utterance (X,Y): 


P(YY|x)= > TP IX) (12.3) 


A€Ax y t= 


where Ay y is the set of valid alignments. Dynamic programming is used to improve 
the computation of the CTC loss function. By supplying blank tokens around each 
label in the sequence, the paths can be easily comparable and merged when they 
reach the same output at the same time step. 

Combining everything gives the loss function for CTC. 


T 
Lerc(X,Y)=-log & [|] P(a |X) (12.4) 
1 


acAyx y t= 
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The gradient for backpropagation can be computed for each time step from the 
probabilities at each frame. 

CTC assumes conditional independence between each time step in that the output 
at each time step is independent of the previous time steps. Although this property 
allows for frame-wise gradient propagation, it limits the ability to learn sequential 
dependencies. Using a language model (Sect. 12.5.2) alleviates some of the issues, 
by providing a word or n-gram context. 


12.2.1 End-to-End Phoneme Recognition 


CTC was initially successful on the TIMIT [ZSG90] phoneme recognition task 
[GMH13]. Various architectures, trained with CTC, were explored yielding state-of- 
the-art performance on the task. The architecture mapped Mel filter-bank features 
to the phonetic sequence with a single end-to-end network. The authors explored 
unidirectional and bidirectional RNNs. A stacked, bidirectional LSTM architecture 
provided the best results. Bidirectional RNNs seemed to allow the network to lever- 
age the context of the whole utterance, rather than the forward only context. 

The authors used two regularization techniques in the training of this network: 
weight noise and early stopping. Weight noise adds Gaussian noise to the weights 
during training to reduce overfitting to specific sequences. These regularization tech- 
niques turned out to be crucial to the training of the network. 


12.2.2 Deep Speech 


Following the success of CTC in phoneme recognition, others attempted to use 
it with different output representations. The Deep Speech (DS1) architecture 
[Han+14a] was trained to predict a sequence of character probabilities to pro- 
duce a transcript directly from the audio features (in this case the spectrogram). The 
Deep Speech network consisted of a DNN architecture with three fully connected 
layers, one bidirectional LSTM layer, which took the place of the HMM, and a 
fully connected output softmax layer that classifies the predictions as one of the 
characters in the alphabet. The input layer relied on frames from the spectrogram, a 
central frame with a set of 5—9 context frames on each side. An illustration of this 
architecture is shown in Fig. 12.4. 
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Fig. 12.4: RNN model used in the original Deep Speech paper. The architecture 
incorporates a single bidirectional LSTM layer after three fully connected layers 
that lean features on the input spectrogram 


Given the complexity of the end-to-end mapping to characters, a significant com- 
ponent Deep Speech’s success was the size of the dataset: 5000 h from 9600 speak- 
ers. Despite the increase in the size of the dataset, regularization is still essential 
to the generalization of the network, so the models were trained with dropout as 
well as data augmentation. One technique inspired by “jittering” in computer vision 
was leveraged, translating the audio file by 5 ms forward and backward. The output 
probabilities for the jittered examples are averaged before backpropagation. 

One of the exciting components of the Deep Speech work is that the RNN model 
can learn a light character-level language model during the training procedure, pro- 
ducing “readable” transcripts even without a language model. The errors that appear 
tend to be phonetic misspellings of words, such as bostin instead of boston. 


12.2.2.1 GPU Parallelism 


Given the size of the dataset and computational requirements of the architecture, 
multiple GPUs were needed to facilitate training. The Deep Speech work was pivotal 
in overcoming many engineering challenges, such as how to train on large datasets. 
Many contributions of the paper focused on scaling the training of the architecture 
on multiple GPUs. Two types of parallelism were used to train the models across 
multiple GPUs: data parallelism and model parallelism. Data parallelism focuses on 
retaining a copy of the architecture on each GPU, splitting a large training batch 
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across the separate GPUs, performing the forward and backward propagation steps 
on the separate data, and finally aggregating the gradient updates for all of the mod- 
els. Data parallelism provides near linear scaling with the number of GPUs (it may 
impact the convergence rate, due to the effective batch size). The second type of 
parallelism is model parallelism. Model parallelism focuses on splitting the model’s 
layers and distributing the layers across the set of available GPUs. Incorporating 
model parallelism can be difficult when working with recurrent neural networks, 
due to their sequential nature. In the Deep Speech architecture, the authors achieved 
model parallelism by splitting the model in half along the time dimension. These de- 
cisions allowed the authors to train on 5000h of audio and achieve state-of-the-art 
results on two noisy speech benchmarks. 


12.2.3 Deep Speech 2 


In Deep Speech 2 (DS2) [Amo+16], a follow-on paper to Deep Speech, the authors 
extended the original architecture to perform character-based, end-to-end speech 
recognition. The authors validated the modeling techniques on both English and 
Mandarin Chinese transcription. The Deep Speech 2 modifications introduced many 
improvements to the original architecture, as well as engineering optimizations 
achieving 7x speedup over the original Deep Speech implementation. Figure 12.5 
shows the updates architecture. 


“or20¢ DC ‘onvoluti ‘ion 





Fig. 12.5: The Deep Speech 2 architecture incorporated convolutional layers that 
learned features from utterance spectrograms and significantly increased the depth 
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The main difference in the Deep Speech and Deep Speech 2 architectures is the 
increase in depth. In the Deep Speech 2 work, many different architectures were 
explored, varying the number of convolutional layers between | and 3 and the num- 
ber of recurrent layers from 1 to 7. The optimal DS2 architecture for English tran- 
scription included 11 layers (3 convolutional, 7 bidirectional recurrent, and 1 fully 
connected layer). Batch normalization is incorporated after each layer (apart from 
the fully connected layer), and gradient clipping was also included to improve con- 
vergence. The overall architecture contained approximately 35 million parameters. 
With the incorporation of an n-gram language model, this leads to a 43.4% relative 
improvement in WER over the already competitive DS1 architecture. 

Other key components in the improvements of Deep Speech 2 were training tech- 
niques and further increasing the dataset size. Training can be unstable in the early 
stage of CTC models. The authors use a training curriculum to improve the stability 
of the model when training. By first selecting the shorter utterances, the model can 
benefit from smaller gradient updates in the earlier part of the first epoch. Addition- 
ally, the authors increased the size of the dataset to 12,000 h in Deep Speech 2. They 
note that scaling the data decreases the WER by 40% for each factor of 10 increase 
in the training set size. 


12.2.4 Wav2Letter 


Wav2Letter [CPS16] extends end-to-end models to CNN-only networks. This work 
showed competitive results to other end-to-end networks, such as Deep Speech 2, 
with a fully convolutional network operating on MFCCs and power-spectrum fea- 
tures. The CNN was trained with CTC, and achieved significant increases in speed, 
with the capability of producing real-time decoding. 

After training the network, intermediate 1-D convolution layers are added be- 
tween the input and initial convolution layers. The input to the network was then 
changed to the raw waveform, with the aim of learning to produce features similar 
to the MFCCs used initially. After training these layers, the whole network is trained 
jointly for end-to-end optimization. The end-to-end network operating on the raw 
waveform showed a modest degradation in accuracy, even though it operated di- 
rectly on the waveform. Figure 12.6 shows the proposed architecture. 

The authors also explored a novel sequence loss function called the automatic 
segmentation criterion (ASG) in addition to CTC. ASG has no blank label, no nor- 
malized scores on the nodes, and global normalization instead of frame-level nor- 
malization. We may recall that we use the blank character to delimit double letters. 
Instead, ASG incorporates an additional character specifically for repetition (e.g., 
“hello” could be represented as “‘hel20’’). 

The removal of RNNs from the architecture makes predictions much less com- 
putationally costly, as well as allowing for streaming transcription (the convolutions 
stride across the input to reveal the output at each time step). In the follow-up work 
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Fig. 12.6: Wav2Letter architecture for recognition on a raw waveform. The first 
layer is not included when training on MFCCs instead of the raw waveform. The 
convolutional parameters are organized as (kw, dw, dim ratio), where kw is the ker- 
nel width, dw is the kernel stride, and dim ration is the number of input dimensions 
to the number of output dimensions 


on Wav2Letter++ [Pra+18], the authors improved the speed of the ASR system, 
achieving linear scaling for training times (up to 64 GPUs). 


12.2.5 Extensions of CTC 


CTC provides an elegant way to compute pseudo-alignments for unaligned se- 
quences; however, the frame independence assumption does have drawbacks. Vari- 
ous techniques have been introduced to relax the frame independence assumption. 
The most notable are Gram-CTC and the RNN transducer. 


12.2.5.1 Gram-CTC 
Gram-CTC [Liu+17] extended the CTC algorithm to address the fixed alphabet and 


fixed target decomposition. This approach focused on learning to predict the char- 
acter n-grams rather than single characters, allowing the model to output multiple 
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characters at a given time step. Using character n-grams can mildly alleviate the ef- 
fects of the frame independence assumption, due to the need to learn multiple labels 
together. 

The work also experimented with automatically learning the character n-grams 
(referred to as “grams”) during the training process, leveraging the forward— 
backward algorithm. Although it is feasible for both the grams and transcription to 
be learned jointly, the model needs to learn the alignment and decomposition of the 
target in tandem, and the training becomes unstable. Multitask learning is used to 
combat this instability by jointly optimizing CTC as well as Gram-CTC. Overall, 
the incorporation of grams resulted in improvements across multiple datasets, even 
when the grams were manually selected. 


12.2.5.2 RNN Transducer 


The RNN transducer [Gral2] extends CTC by assuming a local and monotonic 
alignment between the input and output sequences. This approach alleviates the 
conditional independence assumption of CTC by incorporating two RNN layers 
that model the dependencies between outputs at different time steps. 


Penn-r(Y¥|X)= > P(alh) 
acAx y 


7 (12.5) 


ST] Pail yt 1) 


acAx y f=) 


where u, signifies the output time step aligned to the input time step rf. T’ is the length 
of the alignment sequence including the number of blank labels predicted. Note that 
Y1:, 18S the sequence of predictions excluding blanks up to time step u. The RNN 
incorporates the full history of the non-blank labels into the CTC prediction at the 
next time step. Training the RNN-T model requires using the forward—backward 
algorithm to compute the gradients (similar to the CTC computation). In online 
speech recognition, a unidirectional RNN can be used to model the dependencies 
between time steps in the forward direction. 


12.3 Seq-to-Seq 


The success of sequence-to-sequence models in machine translation prompted their 
application in speech recognition. One of the most significant benefits of seq-to-seq 
models in speech recognition is that they do not rely on CTC for training, natively 
alleviating the frame independence assumption of CTC. Typically in speech recog- 
nition, there are a large number of time steps in the input and output that make it 
infeasible to train basic seq-to-seq models with a single hidden state representing 
the full utterance. 
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Instead, the attention-based approach is used and can model the probability of 
the output sequence directly: 


U 
P(Y|X) = [] POw|y1u-1,X) (12.6) 
u=1 
This quantity can be estimated by the attention-based objective function from 
[Bah+ 16c]: 
h, = Encoder(X) 


. ontentAttention(qy—1,h;) 
Ayt = 


LocationAttention({a,—1 }¢_,,qu—1,h 
({au = qu 1) (12.7) 


T 
Cy = SY) duchy 
7=1 
PUG Views) = Decoder(€y, Qu—1,Yu—1) 


The encoder neural network produces a hidden representation h, of the acoustic 
input and decoder produces the transcript output from the encoded sequence. The 
attention weight, a,;, 1s used to compute the context vector c, for the decoder. The 
decoder hidden state, g,, provides the cumulative context of the decoder’s predic- 
tions into the next prediction. We consider two types of attention here: content-based 
and location-aware attention [Cho+15c]. 


12.3.0.1 Content-Based Attention 


Content-based attention learns a weight vector g and two linear layers,W and V 
(without bias parameters), to weigh the previous prediction and the encoder hidden 
state h,. This is represented as follows: 


a g! tanh(Wq,,-1 + Vh;) (12.8) 


Ay = Softmax({e,;})_,) (12.9) 


12.3.0.2 Location-Aware Attention 


Location-aware attention is an extension to support convolution. This feature ac- 
counts for the alignment at the previous step. This can be defined as: 


{f;}f2) = Ka, (12.10) 


where * represents the one-dimensional convolution operation over the time axis 
t with convolutional matrix K. A linear layer U is also learned to map the output 
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features f, into the feature space. 
ey = g' tanh(Wq,_1 + Vh; + Uf,) (12.11) 


Ay = Softmax({e,;}*_,) (12.12) 


One of the difficulties of training attention-based networks is the simultaneous 
optimization of: 


e the encoder weights, 
e the attention mechanism for computing the correct alignment, and 
e the decoder weights. 


The dynamics of the network make it difficult, especially in the early stages with 
regularization being a key component for these models. 


12.3.1 Early Seq-to-Seq ASR 


Attention was successfully applied in [BCB 14a] extending the work in computer 
vision [MHG-+14] to the task of machine translation to the RNN encoder—decoder 
from [Cho+ 14]. 

[Bah+ 16c] applied seq-to-seq to speech recognition. The attention mechanism in 
this work focused the decoder on a range of the encoder outputs. Attention not only 
helped the convergence of the model but also improved the training time (Figs. 12.7, 
12.8 and 12.9). 


12.3.2 Listen, Attend, and Spell (LAS) 


The listen, attend, and spell (LAS) network [Cha+16b] used a pyramid BiLSTM to 
encode the input sequence, referred to as the listener. The decoder was an attention- 
based RNN to predict characters. 

The drawback for seq-to-seq models is that they tend to be more difficult to train 
(more so than CTC) and slower during inference. The decoder cannot predict until 
the attention mechanism has weighed all of the previous hidden states for each new 
time step. Some techniques have been introduced to deal with this, such as window- 
ing mechanisms to reduce the number of time steps considered during decoding and 
label smoothing, which prevent overconfidence in predictions. 

One of the other difficulties for seq-to-seq models is that they cannot be used in a 
full online streaming fashion. The entire context must be encoded before decoding 
can begin. 

In [VDO+16], the Wav2Text architecture used a CNN-RNN model with at- 
tention to predict character-based transcripts directly on the raw waveform. The 
encoder is a convolutional architecture combined with two bidirectional LSTMs, 
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Fig. 12.7: Attention-based end-to-end ASR model from [Bah+16c] 


and the decoder is a single bidirectional LSTM. The convolutional layers are used 
mainly to reduce the dimensionality of the input. Due to the additional complex- 
ity of attention and the utilization of the raw waveform, the network was trained via 
transfer learning. Initially, only the lower encoder layers predict the spectral features 
(MFCC and log Mel-scale spectrogram) as the target from the raw input waveform. 
The network is then trained with these features through the attention-based encoder— 
decoder with CTC to produce a transcript. 


12.4 Multitask Learning 


Many of the drawbacks of attention and CTC led to multitask learning approaches. 
Attention usually performs better in end-to-end scenarios; however, it typically has 
difficulties converging and tends to suffer in noisy environments. CTC, on the other 
hand, usually yields lower quality due to the conditional independence assumption, 
but is more stable. The trade-offs between CTC and attention make their combina- 
tion highly valuable via multitask learning. ESPnet [KHW17, Xia+18] was trained 
to do just this: jointly optimizing an attention-based encoder—decoder model with 
CTC and attention. 
The training loss for ESPnet is a multi-objective loss (MOL) defined as: 


Luo = “1 log Pete(C|X ) + (1 — A) log P7,(C|X) (12.13) 
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Fig. 12.8: Attention-based end-to-end ASR model from [Cha+16b], using a pyra- 
mid LSTMs in the encoder 
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Fig. 12.9: End-to-end speech processing network from [KHW 17] 
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where A is the weight for each loss function and0 <A < 1. Py. is the CTC objective 
and P7,, is the attention objective. 

The ESPnet architecture uses a 4-layer bidirectional LSTM encoder and a 1-layer 
LSTM decoder. To reduce the number of time steps of the output, the top two layers 
of the encoder read every second state, which reduces the length of the output h by 
a factor of 4. 


12.5 End-to-End Decoding 


CTC and attention-based models are end-to-end, producing a transcript directly 
from the acoustic features. Although they have the capability of learning inherent 
language models during training, the amount of language data seen during training 
is relatively small. In most circumstances, the decoding procedures can improve the 
predictions, and in many cases significantly improving word error rates. The desir- 
able state is to incorporate additional information during the decoding process to 
improve predictions, using a beam search and language model. A beam search can 
incorporate a broader context into the predictions, while language models can take 
advantage of large text corpora that may not have utterance-transcript pairings. 

In [Hor+17], two methods are introduced for decoding a combined CTC— 
attention model. The first approach rescores the predictions and the second method 
does one-pass decoding incorporating the probabilities from each of the attention 
and CTC predictions. 

In [HCW18], the authors incorporate word and character-based RNN language 
models into the decoding procedure. 


12.5.1 Language Models for ASR 


The decoding process can be extended by providing a prior of the language in the 
form of a language model. These language models can be trained on large volumes 
of text data to accurately bias predicted transcripts to particular domains. 


12.5.1.1 N-gram 


In the Deep Speech 2 paper, the authors experimented with n-gram language mod- 
els. Although the RNN layers included in the architecture learn an implicit language 
model, it tends to err on homophones and spelling of certain words. Therefore, an n- 
gram language model was trained using the KenLM [Hea+13] toolkit on the Com- 
mon Crawl Repository,! using the 400,000 most frequent words from 250 million 
lines of text. 


' http://commoncrawl.org. 
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The decoding step uses a beam search to optimize the quantity: 


O(Y) = log(Ferc(¥|X)) + alog(Pru(Y)) + BY(Y) (12.14) 


where y(Y) is the number of words in Y. The weight a effects the contribution of 
the language model, and the weight B biases predictions to have more words. Both 
parameters are tuned on a development set. 

The language model was incorporated into the beam search decoding and signif- 
icantly improved the base WER over the no-language model baselines. 


12.5.1.2 RNN Language Models 


RNN language models have surfaced various times in this book. The application of 
RNN language models relies on utilizing the likelihood of the next word to predict 
the most likely sequence of words given the previous word. 

These models can be incorporated as an additional score during the beam decod- 
ing in the same way as the n-gram language model or as a rescoring of the top n 
hypotheses. 

Word-based models suffer from the OOV issue, but they have successfully beaten 
phoneme-based CTC models when trained on very large datasets (125 kh) [SLS16]. 
This limitation has prompted research on incorporating character-based prediction 
when encountering OOV terms [Li+17]. 


12.5.2 CTC Decoding 


Decoding a CTC network (a deep learning network trained with CTC) refers to find- 
ing the most probable output for the classifier at inference time, similar in spirit to 
HMM decoding. Mathematically, the decoding process is described by the function 
h(x): 
h(x) = argmax P(I|x) (12.15) 
leLS! 

In the original connectionist temporal classification publication [Gra+06], two 
methods were proposed: best path decoding and prefix search decoding. 

Best path decoding, also known as greedy decoding, outputs the most probable 
output at each time step. To obtain a useful string, repeated characters are then col- 
lapsed and the blank token is removed to obtain the hypothesis, h. 


h(x) = B(x") 


1” = argmax p(7|x) 
mEN! 


(12.16) 
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This decoding scheme is straight-forward. However, it is not likely to produce the 
best sequence because it does not consider the multiple paths to obtain the same 
alignment. 

A beam search can be incorporated into the decoding process to improve pre- 
diction. With the beam search, the probabilities of paths leading to the same result 
can be summed, yielding a higher probability for that result. Algorithm | shows the 
beam search decoding process with @ representing the empty sequence and the set 
of beams, B. 


Algorithm 1: CTC beam search 
Input: B< {@};P-(9,0) «+ 1 
Result: maxyen PV (Y,T) 
begin 
fort=1...7T do 
B < the W most probable sequences in B 
Be {} 
for y < B do 
if y ~ @ then 
PUY) HP VI PY ay) 
if § < B then 
| PY, CPt, APY? Y,1) 
P(Y,t) — Pt (Y¥,t —1)P(-,t|x) 
Add Y to B 
fork=1...K do 
P-(Y +k,t) —0 
P*(¥ +k,t) — P(k,Y,t) 
Add (Y +k) to B 


The beam search algorithm can be extended with an n-gram language model. A 
simple approach is to rescore the word sequence each time an end-of-word (space) 
token is reached. However, this relies on the model to predict full words with no 
misspellings. 

A better approach is to use prefix search decoding, which incorporates the sub- 
word level information during the decoding process, utilizing the prefixes of the 
language model. Converting a word-level language model to a “label-level” or 
character-based model is accomplished by representing the output sequence as the 
concatenation of the longest completed word sequence and the remaining word pre- 
fix, denoted as w and p, respectively. The function for computing the probability of 
the next label given the current sequence becomes: 


= Lw'e(p+k)x Py(w'|W ) 
re Step PHOT) a 


554 12 End-to-End Speech Recognition 


where P(w’|W) is the probability of the word history transition from W to w’, px is 
the set of dictionary words prefixed by p, and y is the language model weight. 

During decoding, the probabilities of sequence prefixes are computed, with the 
option to end the current prefix or continue extending it. During the beam search, the 
probability of a hypothesis state is modified to also depend on the probability of a 
prefix, dictionary entry, or n-gram language model when determining the extension 
probability. 

This method relies on the forward—backward algorithm, where the computation 
grows exponentially with the number of states and time steps. We can improve the 
efficiency of the decoding by pruning the output sequence, removing all outputs 
where the probability of the blank token is above a specified threshold. Because the 
output activations tend to be “peaky,’ this dramatically reduces the number of states 
considered and consistently outperforms best path decoding. 

This algorithm can be used without a language model by setting the probabilities 
to 1. The prefix algorithm presented in [Han+14b] is given in Algorithm 2. 


Algorithm 2: CTC prefix beam search 


Input: P, (2; X1:0) <- L, Pel S321) + 0 
Aprev < {@} 
Result: most probable prefix in Aprey 
begin 
fort =1...7T do 
Anext “— {} 
for 1 € Aprey do 
for c € 2 do 
if c = blank then 
' Pp (1; x11) <— P(blank; x; ) (Py (Us 12-1) + Pop (Us%12-1)) 
add / to Anext 
else 
I* — concatenate / and c 
if c = long then 
L Pap ig) PG) P(E a1) 
Pap (13X12) — P(esx1) Pp (512-1) 
else if c = space then 
Bgl’ aig) <— 
P(W(I*)|W(1))°P(c3x1) (Ph (Us X12-1) + Pap (Us %12-1)) 
else 
| Pad wig) = PGs) Pie + Pp (13 x14-1)) 
if /* not in Aprey then 
L Py (I 3x11) — P(blank; x;) (Py (U7 3x1:1—1) + Pap (U7 3x11-1)) 
Pap (ET 5x42) — P54) Pao (Et 5 x1:2-1) 


add I* to Apext 


Aprev «~ k most probable prefixes in Anext 
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This approach also requires length normalization, to prevent a bias towards se- 
quences with fewer transitions. 


12.5.3 Attention Decoding 


Attention decoding already produces the most probable sequence given the previous 
predictions. Therefore, as seen previously, greedy decoding could be applied here, 
yielding the most probable character at each time step. However, it likely would not 
yield the most probable sequence C. 


C = argmax log P(C|X) (12.18) 
CeU* 


A beam search can also be applied to attention models during the decoding process. 
Because the previous time step is provided as an input to the next prediction, the top 
n most probably paths at each time step can be retained at each time step. The beam 
search begins by first considering the start of sentence symbol, < s >. 


a(h,X) = a(g,X)+log P(clg7_1,X) (12.19) 


where g is a partial hypothesis in the beam, and c is a symbol/character appended 
to g, yielding a new hypothesis h. An example of beam search attention decoding is 
shown in Fig. 12.10. 

Various architectures have aimed to use this additional unpaired data in the end- 
to-end ASR models [Tos+18]. The term fusion has recently been coined, referring 
to the integration of these language models into the main acoustic model. 


12.5.3.1 Shallow Fusion 


Shallow fusion (used originally for NMT) combines the scores of the LM and ASR 
models during the decoding [Gul+ 15]. This type of language model decoding incor- 
porates an external language model during the beam search to incorporate word or 
character probabilities into consideration. Shallow fusion can be used with a word 
or character-based language models to determine the probability of a particular se- 
quence. 

Y* = argmax log P(Y |X) +APru(Y) (12.20) 

Y 


Character language models are helpful for rescoring hypotheses before a word 
boundary is reached or as a rescoring mechanism for character-based languages, 
such as Japanese and Mandarin Chinese. Additionally, character-based language 
models can predict unseen character sequences, which a word-based model would 
not allow. 
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Fig. 12.10: Beam search decoding example with a beam size of 2 on a three- 
character alphabet {a,b,c}. With attention decoding, the previous time step is incor- 
porated into the next character prediction. Therefore the probabilities are dependent 
on the path chosen. The best paths at each time step are highlighted, with the darker 
one being the top prediction. Note how the greedy decoding of this example would 
yield a sub-optimal result 


Shallow fusion has been incorporated into RNN-T models, allowing the CTC 
training to alleviate the frame independence while also incorporating the language 
model bias into the prediction [He+ 18]. 


12.5.4 Combined Language Model Training 


When incorporating neural language models into end-to-end ASR, it is quickly ap- 
parent that the two could be optimized together, leveraging the acoustic information 
as well as the linguistic information from large text corpora. The two most popular 
techniques for jointly training the acoustic and language models are deep fusion and 
cold fusion. 


12.5.4.1 Deep Fusion 


Deep fusion [Gul+15] on the other hand incorporates the LM into the acous- 
tic model (specifically an encoder—decoder model), creating a combined network. 
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Combining the network is accomplished by “fusing” the hidden states of pre-trained 
AM and LM models, continuing training to learn the “fused” parameters. During 
this training procedure, the LM and AM parameters are fixed, reducing computa- 
tion costs and converging quickly. 


2= O(v,S)" + bg) 
S = lessees “| (12.21) 
P(y;|h, Y1.7-1)) = softmax(Wprs,” + bpr) 
where ¢; is the context vector, h is the output of the encoder, and vz, bg, bor, and 
Wp, are all learned during the continued training phase. 


12.5.4.2 Cold Fusion 


Cold fusion [Sri+17] extends the idea of deep fusion, incorporating the LM into 
the training procedure. However, in cold fusion the acoustic model is trained from 
scratch incorporating the pre-trained LM. 

si — DNN(d") 

sr? = Wep|d;;¢;| +bep 

g = 0(W,|s,°38,""| + bg) 
sr) = [8/738 05;"" 
ro" = DNN(s¥" ) 
P(y;|h, Yy.¢—1)) = softmax(Wcrr’ + ber) 


(12.22) 


Because cold fusion incorporates the LM into the training process from the be- 
ginning, retraining is required if there are changes in the LM. The original paper 
introduces a means for switching language models by using LM logits instead of 
the LM hidden states; however, this does increase the number of learned parameters 
and computation. 


12.5.5 Combined CTC—Attention Decoding 


Decoding with combined CTC—attention architectures relies on producing the most 
probable character sequence C. Combining the two outputs is non-trivial. Attention 
produces a sequence of output labels, while CTC produces a label per frame. In 
[Wat+17b], the authors propose two methods for combining the CTC and attention 
outputs: rescoring and one-pass decoding. 
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12.5.5.1 Rescoring 


Rescoring relies on a two-step method. The first step is to produce a set of complete 
hypotheses from the attention decoder. The second step is to rescore these hypothe- 
ses based on the CTC and attention probabilities (the forward algorithm is used to 
get the CTC probabilities). 


C = argmax{Aacrc(h,X)+(1—A)oarr(h,X)} (12.23) 
heQ 


12.5.6 One-Pass Decoding 


One-pass decoding, on the other hand, focuses on computing the probabilities of the 
partial hypotheses as characters are generated. 

A language model can also be incorporated into the decoding process [HCW 18] 
by adding an additional language modeling term to the decoder: 


C = argmax{A log Perc(C|X) + (1 —A) log Parr (C|X) + ylogPry(C)} (12.24) 
CeU* 


The score in the beam search can then be described as: 
a(h) = Aactc(h) + (1 = 1) art (h) + yom (h) (12.25) 


for each incomplete hypothesis h. 
Computing the attention and language model scores is straight-forward, with: 
a h)=a + log Parr(clg,X 
art (h) ATT (8) g Parr (clg,X) (12.26) 


om (h) m(g) + log Puu(clg,X) 


where h = g;c, g 1s a known hypothesis, and c is a character being appended to the 
sequence to generate h. 

CTC, however, is more nuanced due to the number of sequences that could pro- 
duce the character sequence. Therefore the CTC score is the sum of all sequences 
with / as the prefix. 


PU |X) = »y P(h;v 
ve(UU<EOS>)t 





X) (12.27) 


The CTC score becomes: 


ocrc(h) = logP(h,...|X) (12.28) 
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12.6 Speech Embeddings and Unsupervised Speech Recognition 


The amount of unsupervised data that is available can be orders of magnitude higher 
than the amount of paired speech-text parallel corpora. Thus, unsupervised speech 
recognition and acoustic embeddings for audio processing are promising areas of 
research. 


12.6.1 Speech Embeddings 


One of the earliest works for embeddings in speech was [BH14]. In this work, the 
authors used a form of Siamese network to train acoustic word embeddings where 
similar sounding words (acoustically similar) are clustered near each other in the 
embedding space. In this fashion “words are nearby if they sound alike.’ By mod- 
eling words directly, the paradigm of speech recognition shifts away from trying to 
model states in the traditional HMM. 

This network is trained in two parts: initially, a CNN classification model is 
trained to classify spoken words in a fixed segment of audio (2s). Second, this net- 
work is fixed and incorporated in a word embedding network. The word embedding 
network is trained to align the embedding of the correct word with the acoustic 
embedding while separating wrong words. To reduce the size of the input embed- 
ding space from all words by using bag-of-letter-n-grams, only the top 50,000 letter 
n-grams are used to reduce the size of the input embedding space (bag-of-letter-n- 
grams). The architecture diagram is shown in Fig. 12.11. 
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Fig. 12.11: Acoustic embedding model trained with a triplet ranking loss to align 
acoustic vectors and word vectors from subword units 
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The embeddings space yielded similarities such as (please,pleas), (plug,slug), 
and (heart,art). 

A Siamese CNN network was also used in [KWL16] to discriminate between 
separate same and different word pairs given spoken instances of words. This net- 
work achieved similar results as to a strongly supervised word classification model. 


12.6.2 Unspeech 


In Unspeech [MB 18], the authors used a Siamese network to train embeddings used 
with acoustic models for speaker adaptation, utterance clustering, and speaker com- 
parison. This work relies on the assumption that similar areas of speech are likely 
to have the same speaker. The contexts for true and false examples of speakers are 
taken from neighboring contexts windows in the same utterance or from separate 
files. This idea is similar to the concept of negative sampling. This network, there- 
fore, does not expect similar words to be in the same embedding space, but rather 
the same speaker. The architecture is shown in Fig. 12.12. 


12.6.3 Audio Word2 Vec 


One of the drawbacks to the CNN approach is that it requires fixed-length audio 
segments. Audio Word2Vec [Chu+ 16] used a sequence-to-sequence autoencoder to 
learn a fixed representation for variable-length spoken words. Because the learned 
representation is the input itself, it can be learned in a completely unsupervised 
way, hence the reference to word2Vec. The resulting model is, therefore, able en- 
code acoustic samples for use in a query-by-example system for words. Training 
the model does not require supervision; however, creating the word embeddings 
requires knowledge of word boundaries in the embedding process. 

Audio Word2Vec was extended in [WLL18] to utterance level by learning a seg- 
mentation method as well. The method is an example of a segmental sequence-to- 
sequence autoencoder (SSAE). The SSAE learns segmentation gate to determine 
the word boundaries in the utterance and a sequence-to-sequence autoencoder that 
learns an encoding for each segment. Some guidance is needed to keep the autoen- 
coder from splitting the utterance into too many embeddings; however, learning an 
appropriate estimate is not differentiable. Reinforcement learning is used to estimate 
this quantity, due to the non-differentiability of learning a discrete variable. 


12.7 Case Study 561 


Logistic Loss 


I 


Dot Product 










Embedding 
Vectors 


True 
| | Context 
Log-Mel Target | 
Window False 
Context 
Log-Mel Context 
Window 


Fig. 12.12: Unspeech embeddings are trained using a Siamese CNN network 
(VGGI16A), to compute embedding vectors. The dot product of the two vectors is 
computed, and a logistic loss is used to optimize a binary classification task, of 
whether the context window was a true or false context window of the target 


12.7 Case Study 


In this case study, we continue to focus on building ASR models on the Mozilla 
Common Voice dataset.” In this chapter, we focus specifically on a Deep Speech 
2 model that trains an end-to-end network with CTC and a hybrid attention-CTC 
model. 


12.7.1 Software Tools and Libraries 


Since the release of the Deep Speech 2 paper, there have been multiple open sourced 
implementations of the architecture, with the most common difference being the 
deep learning framework used. The most popular are the TensorFlow implemen- 
tation by Mozilla,’ the PaddlePaddle implementation,* and the PyTorch version.> 
Each has a variety of benefits and drawbacks, some of which are the deep learning 


* https://voice.mozilla.org/en/data. 

> https://github.com/mozilla/DeepSpeech. 

4 https://github.com/PaddlePaddle/DeepSpeech. 

> https://github.com/SeanNaren/deepspeech.pytorch. 
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framework, the amount of preprocessing required, variable-length vs. fixed-length 
RNNs, as well as others. We focus on the PyTorch implementation for its simplicity. 

One of the most recent advancements has been the CTC+attention models, 
specifically ESPnet.° This toolkit focuses on end-to-end speech recognition and 
text-to-speech. It uses Chainer and PyTorch as backends for the toolkit and provides 
Kaldi-style recipes for some of the most modern architectures. 


12.7.2 Deep Speech 2 


The Deep Speech 2 implementation used is written in PyTorch. It incorporates a 
parallelized data loader to speed model training, an optimized CTC loss function, 
a CTC-decoding library with language model support, and data augmentation for 
acoustic model training. 


12.7.2.1 Data Preparation 


The data preparation requires either a directory structure or manifest file. In the first 
approach, a dataset directory is structured as follows (Figs. 12.13, 12.14, and 12.15). 

There is no additional need for phonetic dictionaries for character-based mod- 
els; the data is processed into a spectrogram and then converted to a tensor at data 
loading time. 

In this implementation, one can also use a “manifest” file to define the datasets 
used. The manifest is similar to the Kaldi and Sphinx structures, containing a list of 
the examples in each dataset split. Manifest files can be useful for filtering longer 
files when using variable-length RNNs. 


12.7.2.2 Acoustic Model Training 


First, we train a base model given the default configuration. The resulting model has 
two convolutional layers and five bidirectional GRU layers, yielding approximately 
41 million learnable parameters. We enable the augmentation step during training 
as well, which applies small changes to the tempo and gain to reduce overfitting. 


© https://github.com/espnet/espnet. 
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1 /common_voice 
2 /train 
3 Xt, 
4 train_sample0Q00000.txt 
5 train_sampleQ00001.txt 
6 ae 
7 wav / 
8 train_sample000000 .wav 
9 train_sample000001 .wav 
10 ae 
11 /val 
12 txt / 
13 
14 wav / 
15 
16 FES 
17 ait 
18 
19 wav / 
20 
21 
Fig. 12.13: Directory structure for Deep Speech 2 
1 /path/to/train_sample000000.wav,/path/to/train_sample000000. txt 
2 /path/to/train_sample000001.wav,/ path/to/train_sample000001.txt 
3 
4 


Fig. 12.14: Manifest structure for the training set for Deep Speech 2 


1 python train .py —-train—manitest data/traim_manifest.csy ——val— 
manifest data/val_manifest.csv 


Fig. 12.15: Training function for Deep Speech 2 


We train all models on a GPU,’ with early stopping based on the WER of the 
validation set. In our case, the model began diverging after about 15 epochs, as 
shown in Fig. 12.16 and achieves its best validation WER of 23.470. Once the model 
is trained we evaluate the best model on the test set, where we achieve an average 
WER of 22.611 and CER of 7.757, using greedy decoding. A few samples of the 
greedy decoding of the trained model are shown in Fig. 12.17. 


7 Although it is possible to train this model on a CPU, it is unrealistic due to the computationally 
intensive nature of the convolutional and recurrent layers. 
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Fig. 12.16: Training curve of Deep Speech 2 with the default configuration 


1 Ref: understand sheep they re no longer a problem and) they can 
be good friends 

2 Hyp: 1 understand shee they’re no longery problem and they can be 
good friends 

WER: 0.214 CER: 0.027 


3 
4 
5 Ret was ne looked meat sthcemstomes new lcltstclicyvoa lon some tcasomn 
6 Elyp. sashie looked at the stones he fely relieved itor som ason 

7 WER 705332 CER. 90,051 

8 


Fig. 12.17: Output from the base Deep Speech 2 model. Note how many of the 
mistakes seem phonetic and create nonlogical words, such as shee and ashe 


12.7.3 Language Model Training 


The character-based predictions produce reasonable transcripts, without a language 
model. However, we can improve the greedy predictions by providing a language 
model during the decoding phase. We leverage the ctcdecode® package to apply 
different decoding schemes, which is integrated into the PyTorch Deep Speech 2 


8 https://github.com/parlance/ctcdecode. 
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implementation. One thing to note about this language model is that it incorporates 
a character FST as well. The FST acts as a spell checker, enforcing the production 
of words. 

Decoding schemes can be applied to improve the error rates of the predictions. 
These results are summarized in Table 12.1. 

The KenLM toolkit [Hea+13] is used to train an n-gram language model. The 
language model is created from transcripts of the training corpus to provide com- 
parable results to previous case studies. In practice, language models are usually 
trained on very large training corpora such as the previously mentioned, Common 
Crawl (Fig. 12.18).? 


1 kenny bila) bin, iniplz =o. 2 trainine=ttanseripis .1<t\ — 
cv_2gram_Im.arpa 

2 

3 kenlm?y build bin, build2binary sev -2eram-Inmsarpa cv =2eram-lin) thie 

4 


Fig. 12.18: Train a 2-gram language model with KenLM on the training transcripts. 
The first command creates an ARPA language model from the transcripts, and the 
second command creates a binary trie-structure from the language model used in 
the decoding phase 


We determine the best language model for the system by evaluating them on the 
validation set, and the best model is chosen to apply to the testing set. Table 12.1 
summarizes the WER and CER for different language models. 


Table 12.1: Validation results for different decoding methods. The best results are 
in bold 








Decoding method CER 


None 22.832 | 8.029 
2-gram 12.919} 7.292 
3-gram 12.027 | 6.990 
4-sram 11.865 | 6.915 
5-gram 11.977} 6.955 


After applying the language model with the default beam size (beamwidth = 10), 
we see that our best model is the 4-gram model. Now, we can increase the size of 
the beam to evaluate the impact on the predictions. The results are summarized in 
Table 12.2. 


? http://commoncrawLorg. 
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Table 12.2: Validation results for different beam sizes. The best results are in bold 










Decoding method CER 
4-sram, beam=10 |11.865)6.915 
4-sram, beam=64 | 7.742 |4.458 
4-sram, beam=128] 6.939 |3.984 
4-sram, beam=256] 6.288 |3.616 
4-sram, beam=512 





1 Rete aT understand sheep they re no longer a problems and they can 
be good friends 

Hyp: 1 understand sheep they ’re no longer a problem and they can 
be good friends 

WER: 020! CER: 0:0 


No 


Ref: as he looked at the stones he felt relieved for some reason 
Hyp. as he looked at the ‘stenes We felt relieved for some as 
WER: 0.083 CER: 0.068 


So wea NI Dn nm Ee WwW 


Fig. 12.19: Test output with language model decoding. Note many of the phonetic 
mistakes are corrected when incorporating the language model during the decoding; 
however, it can also cause different mistakes. In the second example, greedy decod- 
ing output ason instead of reason, but after the application of the language model, 
the hypothesis reduced this to as, reducing the WER and increasing the CER for this 
example 


The computation time increases linearly with the beam size. In practice, it is best 
to choose a beam size that is a good trade-off between performance and quality. 
After applying our best LM (4-gram) with a beam size of 512 to the test set, we 
achieve a WER of 5.587 and CER of 3.232. Some examples of the decoded output 
are in Fig. 12.19. 


12.7.4 ESPnet 


ESPnet!° is an end-to-end speech processing toolkit that draws inspiration from 
Kaldi. It incorporates hybrid CTC-—attention architectures, mainly the ones con- 
tained within [KHW17] and [Wat+17b]. Much of the toolkit is bash script focused, 


10 https://github.com/espnet/espnet. 
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similar to Kaldi, with Chainer and PyTorch backends. In this portion of the case 
study, a hybrid CTC—attention architecture is trained on the Common Voice dataset, 
using the ESPnet toolkit. 


12.7.4.1 Data Preparation 


The data preparation is very similar to Kaldi, with a reliance on Kaldi for some of the 
preprocessing. The main difference is the lack of phonetic lexicons and dictionaries 
required in Kaldi. We generate MFCC features and store them in a JSON format. 
This format contains the target transcript, the tokenized transcript, location of the 
features, and some additional information for various components of the training. 
An example of the formatted training data is shown in Fig. 12.20. 

After extracting the features and creating the input file, the network is ready to 
train. 


12.7.4.2 Model Training 


The model training procedure also follows the Kaldi scripts to some degree; how- 
ever, once the features are extracted, we run the training scripts. 

The model trained is a 4-layer bidirectional LSTM encoder and a 1-layer uni- 
directional LSTM decoder. We train this model with Adadelta for 20 epochs on a 
single GPU. The full list of training arguments is shown in Fig. 12.21. 

During the training procedure the losses for both CTC and attention can be mon- 
itored to ensure that there is consistency in the convergence. The overall loss for 
training and validation is the weighted sum of the two components. We also notice 
that the validation loss trends with the training data loss until the final epoch. In this 
example, we set a hard stop for computational reasons on the number of epochs run. 
To obtain our best model, we would continue training until the validation consis- 
tently diverges from the training loss, and choose the model that performs best on 
the validation data (Fig. 12.22). 

The accuracy curves, shown in Fig. 12.23, display the network performance 
as training progresses. The first two epochs show significant gains in the early 
stages, with modest improvements as training progresses. Our best model in training 
achieves a WER of 12.07 on the validation data. 

We can inspect the output attention weights during the decoding process by plot- 
ting the weight of each time step during the decoding. Visualizing attention, as be- 
fore, shows what portion of the input is attended to during inference. This is shown 
in Fig. 12.24. We notice that the output generally correlates with the input audio file, 
yielding an aligned output that is capable of segmenting the audio, as well as deal- 
ing with the offsets in time. During the early stages, we notice some breaks in the 
attention alignments to the input, whereas in the latter case, the attention alignments 
appear seamlessly aligned to the input. 
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Fig. 12.20: data.json input file format for ESPnet training 


| python asrotrain. py ——-backend) pytoreh —-oultdir exp) results ——dict 


davay laneelichat, (raimeamodey Suni tS tcl —— imum alec Ss Om = 
resume ——train—json dump/cv_valid_train/deltafalse/data.json 


—valid—json dump/cv_valid_dev/deltafalse/data.json ——etype 


blstmp —elayers 4 —eunits 320 —eprojs 320 ——subsample 1 
_2.2_1_1 —dlayers 1 —dunits 300 —atype location ——adim 320 
Se ACONV Ch Ans = Ota ICOM Filly sen hU0 = tinelia linia Ons ae aC Mle 
size 30 ——=maxlen—in 800 ——-maxlen—out 150° ——sampling — 
probability 0.0 —opt adadelta —-epochs 20 


Fig. 12.21: Training command for ESPnet 


12.7 Case Study 569 


main/loss 
validation/main/loss 
main/loss_ cte 
validation/mainjloss_ctc 
main/loss_att 
validation/main/loss_att 


FEET 





2.9 3.0 a3 10.0 12.5 15.0 17.5 20.0 


Fig. 12.22: Losses during training 
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Fig. 12.23: Training and validation accuracy curves for the model 


Our base model achieves a WER of 12.34 and CER of 6.25 on the testing set with 
greedy decoding (beam size of 1). When incorporating a beam search of 20 (ESPnet 
default selection) into the predictions on the testing set, we reduce the WER to 11.56 
and CER to 5.80. We leave tuning the beam size and incorporating a language model 
as an exercise. Note the significant improvement when we added this to the Deep 
Speech 2 architecture. 
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Fig. 12.24: Attention weights for a single file on the input audio after the (a) Ist 
epoch and (b) after the 20th epoch 


12.7.5 Results 


We now provide a summary of the techniques evaluated in this case study. The 
testing results are displayed in Table 12.3. 
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Table 12.3: End-to-end speech recognition performance on Common Voice test set 
(best result highlighted) 


Approach WER 
Deep Speech 2 (no decoding) 22.83 
Deep Speech 2 (4-gram LM, beam size of 512) 5.59 
ESPnet (no decoding) 12.34 
ESPnet (no LM, beam size of 20) 11.56 
Kaldi TDNN (Chap. 8) 4.44 


Overall, with a CTC—attention model, we get faster, more stable convergence and 
a lower WER for the base acoustic model compared to the Deep Speech 2 baseline 
(WER of 22.83). 

Although this result is not better than the one achieved with Kaldi in Chap. 8 case 
study, the results between the Deep Speech 2 (with a language model) and Kaldi 
models are comparable, even without a lexicon model. The training procedure is 
more straight-forward than the training steps required for the traditional approaches 
to ASR, such as removing the requirement of iterative training and aligning. Ad- 
ditional benefits can also be gained from the inclusion of a language model during 
decoding to provide compelling results without significant linguistics resources. 


12.7.6 Exercises for Readers and Practitioners 


Some other interesting problems readers and practitioners can try on their own in- 
clude: 


1. What changes are required to train a Deep Speech 2 model on a new language? 

2. What would be the effects of training a language model on more data? 

3. Would the incorporation of the testing transcripts improve the results on the val- 
idation data? What about the testing data? 

4. Does the incorporation of the testing transcripts in the language model corrupt 
the validity of the results? 

5. How could an RNN language model be incorporated into the decoding process 
for Deep Speech 2? For ESPnet? 

6. Perform a grid search for the beam size on the ESPnet model. 
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Chapter 13 ®@ 
Deep Reinforcement Learning eae 
for Text and Speech 


13.1 Introduction 


In this chapter, we investigate deep reinforcement learning for text and speech ap- 
plications. Reinforcement learning is a branch of machine learning that deals with 
how agents learn a set of actions that can maximize expected cumulative reward. In 
past research, reinforcement learning has focused on game play. Recent advances 
in deep learning have opened up reinforcement learning to wider applications for 
real-world problems, and the field of deep reinforcement learning was spawned. In 
the first part of this chapter, we introduce the fundamental concepts of reinforce- 
ment learning and their extension through the use of deep neural networks. In the 
latter part of the chapter, we investigate several popular deep reinforcement learning 
algorithms and their application to text and speech NLP tasks. 


13.2 RL Fundamentals 


Reinforcement learning (RL) is one of the most active fields of research in artificial 
intelligence. While supervised learning requires us to provide labeled, independent 
and identically distributed data, reinforcement learning requires us to only specify 
a desired reward. Furthermore, it can learn sequential decision making tasks that 
involve delayed rewards, especially those that occur far distant in the future. 

A reinforcement learning agent interacts with its environment in discrete time 
steps. At each time f, the agent in state s, chooses an action a; from the set of 
available actions and transitions to a new state s,;,; and receives reward 11. The 
goal of the agent is to learn the best set of actions, termed a policy, in order to 
generate the highest overall cumulative reward (Fig. 13.1). The agent can (possibly 
randomly) choose any action available to it. Any one set of actions that an agent 
takes from start to finish is termed an episode. As we will see below, we can use 
Markov decision processes to capture the episodic dynamics of a reinforcement 
learning problem. 
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Due to the sequential decision-making nature of reinforcement learning, it suffers 
from a difficulty commonly known as the credit assignment problem. Since there are 
many actions that can lead to a delayed award, it is difficult for reinforcement learn- 
ing methods to attribute the subset of actions that had greatest positive or negative 
effect on these rewards. This becomes an especially difficult problem for large state 
and action spaces. 








- Observations — 
f Environment 


Agent f Reward 





Value 





Fig. 13.1: Agent—environment interaction in reinforcement learning 


13.2.1 Markov Decision Processes 


A Markov decision process (MDP) is a useful mathematical framework that models 
situations as a discrete time stochastic control process. Mathematically, an MDP can 
be expressed using the tuple: 


(SG Palas 7) (13.1) 
where: 
s =a finite set of states 
a =afinite set of actions 
Pa =the probability of each action a 
Yq =the reward by taking an action a 
y =time discount factor 


The process is in some state s, and at each time step, the decision maker may choose 
any action a that is available in state s. The process responds at the next time step 
by randomly moving into a new state s’, and giving the decision maker a corre- 
sponding reward R,(s,s’). The probability that the process moves into its new state 
s’ from current state s is influenced by the chosen action and the reward r received. 
Specifically, it is defined by the state transition function p(s’|s,a): 


p(s'|s,a) = Pr{S; =s"|S;-1 = s,A;-1 =a} = ¥& p(s’,r|s,a) (13.2) 
rEeR 
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such that: 
Y > p(s’,r|s,a) = 1, for all s € 8,a € A(s) (13.3) 
s'ESTER 
Thus, the next state s’ depends on the current state s and the decision makers action 
a. But given s and a, it is conditionally independent of all previous states and actions; 
in other words, the state transitions of an MDP satisfies the Markov property. 
Markov decision processes are an extension of Markov chains where the dif- 
ference is the addition of a set of actions (allowing choice) and rewards (giving 
motivation). Conversely, if only one action exists for each state and all rewards are 
equal, a Markov decision process reduces to a Markov chain. 


13.2.2 Value, O, and Advantage Functions 


We define 7; as the reward we receive at time t. We can define the return as the sum 
of the sequence of future rewards: 


GESTS To Seas (13.4) 


Normally, we include a time discount factor y € (0,1), and the future cumulative 
reward can be expressed as: 


co 


Gr= Arse (13.5) 


k=0 


With this definition, we can define the concept of a value function of a state s as the 
expected cumulative return: 


Vis) =E|G;|s; =] (13.6) 


The value function for any particular state is not unique. It depends on the set of 
actions we take going forward in the future. We define a set of future actions known 
as a policy 7: 

a=T(s) (13.7) 


Then the value function associated with this policy is unique: 


Vi(s) = Ex(G;|s; = 5| (13.8) 
=Ex| >, 7 Rata = : (13.9) 
k=0 





Note that while this policy-associated value function is unique, the actual value can 
be stochastic under a non-deterministic policy (e.g., one where we sample from a 
distribution of possible actions defined by the policy): 


t(a|s) = Plals| (13.10) 
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In addition to finding the value function of a particular state, we can also define a 
value function for a particular action given a state. This is known as the action-value 
function or Q function: 


On,(s,a) = Ex|G;|5; = s,a; =a (13.11) 
=e | >. 7 ess eS Sapa (13.12) 
k=0 


Like the value function, the Q function is uniquely specified for a particular policy 
m of actions. The expectation takes into account the randomness in future actions 
according to the policy, as well as the randomness of the returned state from the 
environment. Note that: 


Vie(s) = Eagnn[Qn(s,a)| (13.13) 


The advantage function for a policy 7 measures the importance of an action by 
finding the difference between the state-value and state-action-value functions: 


An(s,a) = On(s,a) — Vi(s) (13.14) 


Because the value function V measures the value of state s following policy 7 while 
the Q function measures the value of following action a from state s, the advantage 
function measures the benefit or loss of following a particular action from state s. 


13.2.3 Bellman Equations 


The fundamental breakthrough of reinforcement learning is a set of propagation 
equations for the value and Q functions. These equations are commonly known as 
the Bellman equations, named after Richard Bellman, an American applied mathe- 
matician. For the state value function, the Bellman equation is given by: 


Vz(s) =Ey r+ Wals')Is = 5 (13.15) 


What this equation tells us is that the state value function associated with policy 7 1s 
the expectation of the reward received at the next state and its discounted state value 
function. Similarly, the Bellman equation for the Q function is given by: 


O,(s,a) = By a [r+ VOUS <0) |S 8 05> a| (13.16) 


The importance of the Bellman equations is that they let us express values of states 
as values of other states. This means that if we know the value of s;,1, we can 
very easily calculate the value of s,. This opens the door to iterative approaches for 
calculating the value for each state, since if we know the value of the next state, we 


13.2 RL Fundamentals 579 


can calculate the value of the current state. Sound familiar? This is similar to the 
notion of backpropagation. 


13.2.4 Optimality 


The goal of any reinforcement learning problem is to find the optimal decisions that 
lead to highest expected cumulative reward. Reinforcement methods fall under one 
of several main categories depending on how they optimize the policy z for: 


1. for the expected reward: 





maxE | Yr (13.17) 
- k=0 
2. for the advantage function: 
maxA,(s,a) (13.18) 
10 
3. for the Q function: 
max Q,(s,a) (13.19) 
TU 


Methods such as dynamic programming or policy gradients seek to optimize ex- 
pected reward, while actor-critic models and Q-learning methods focus on optimiz- 
ing the advantage and Q-functions, respectively. 

For any specific policy of actions, we can use the value function to determine its 
expected reward. There is always at least one policy that is better than or equal to 
all other policies. This is known as the optimal policy, called 7,, which may not be 
unique. All optimal policies share the same state-value function: 


V..(s) = max V;(s) (13.20) 
Lh 
Optimal policies also share the same action-value function: 
Q..(s,a) = max O,(s,a) (13.21) 
mh 


The Bellman equation can be applied to the optimal state-value function v,. to give 
us the Bellman Optimality Equation which is independent of any chosen policy: 


V,.(s) = maxEy [r+ 7V.(s') (13.22) 


Similarly, the optimal action-value function is independent of chosen policy and is 
given by: 


Q..(s,a) =Ey |r + ymax Q,(s',a’)|s,a (13.23) 
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13.2.5 Dynamic Programming Methods 


When the environment is known and completely specified, dynamic programming 
methods can be applied to find optimal policies. The key notion is to use value func- 
tions to search for improved policies. Commonly applied to finite Markov decision 
process problems, dynamic programming underpins an important class of reinforce- 
ment learning algorithms. 


13.2.5.1 Policy Evaluation 


Given a policy 77, we can determine the state-value function vz for this policy. Using 
the Bellman equation above, it is possible to start with an approximation for vz and 
iteratively update an estimate v;, until it converges to vz as k > oo: 


Viwi (8) = En [roi + We(S:41)|51 = 5] (13.24) 
= > @(als) >) p(s’, rls, a) [r+ Wels’) (13.25) 


The above is an expected update since it is based on the expectation over all possible 
next states and actions (Fig. 13.2). 
13.2.5.2 Policy Improvement 


Consider the next action a for state s that is not from policy z. The value of taking 
this action is given by the action-value function: 


On(s,4) = Elrig1 + Wa (S141) | = 5,4 =a (13.26) 
= ¥r(s',r|s,a) [r+ Wr(s')| (13.27) 


If we compare the value of taking this action to our policy 7, we can decide if we 
should adopt a new policy that takes action a. This leads to the policy improvement 
theorem, which states that for any two deterministic policies a and 7’, if: 


On(s,7'(s)) > Vn(s) (13.28) 


it must be that: 
Vii (s) > Vir(s) (13.29) 


When we find that a new policy 7’ is better, we can take its value V,, and use it to find 
a better policy. Here, E' denotes policy iteration and J denotes policy improvement. 
This iterative process is called policy iteration, where we cycle between policy 
evaluation (7 — V,) and policy improvement (V; — 2’) until we find the optimal 
policy 7,: 
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1 —> Vay —> 1% —> Vz, —> ™. —... —> TM, — V, (13.30) 


where FE denotes policy evaluation and J denotes policy improvement. Because a 
finite MDP has a finite number of possible policies, this process will converge to 7. 


13.2.5.3 Value Iteration 


There is a potentially serious drawback with policy iteration in that the evaluation of 
a policy 7 is computationally expensive as it requires iterative calculation over every 
state in the MDP. Instead of waiting for convergence as k — ce, we can approximate 
vz by performing a single update iteration (Vz + Vi+1): 


Via (s) = max Ey Teta + We( 5:41) |S; = 5,q, = al (13.31) 
= max ¥ p(s‘,r|s,a) [r+ Vi(s’)| (13.32) 
o Nr 


This is known as value iteration, which is computationally efficient as it combines 
truncated policy evaluation with policy improvement. 


Dynamic Programming 





v(5,) os E, [a 7 yv(S,,; )| 


Fig. 13.2: Dynamic programming backup diagram 


13.2.5.4 Bootstrapping 


The concept of bootstrapping, an important concept in dynamic programming, refers 
to the estimation of a state or state-action values from estimates of the values of suc- 
cessor states. Bootstrapping is a component in other RL methods such as temporal 
difference learning or Q-learning and enables faster, online learning. However, since 
it is based on a notion of using estimates to make estimates, instability can occur, 
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and methods that bootstrap over longer sequences of successor states will have bet- 
ter convergence properties. 


13.2.5.5 Asynchronous DP 


Dynamic programming methods operate over the entire set of states of a finite MDP. 
Where the set of states is large, DP is intractable, as every state must necessarily 
be updated before one sweep is completed. Asynchronous dynamic programming 
methods do not wait for all states to be updated, but instead update a subset of states 
during each sweep. Such methods will converge as long as all states are eventually 
updated. Asynchronous DP methods are very useful in that they can run in an online 
manner, concurrently as an agent is experiencing the states of the MDP. As such, the 
agent’s experience can be considered in choosing the subset of states to update. This 
is Similar to the concept of beamsearch. 


13.2.6 Monte Carlo 


Unlike dynamic programming methods that require complete knowledge of the en- 
vironment, Monte Carlo (MC) methods learn from a set of agent experiences. These 
episodic experiences are actual or simulated sequences of actions, states, and re- 
wards from the interaction of the agent with the environment. MC methods require 
no prior knowledge but can still yield optimal policies by simply using averaged 
sample reward for each state and action. 

Consider a set of episodes EF, where each occurrence of state s € E is called a 
visit. To estimate v;(s), we can follow each of the visits all the way to the end of the 
episode to calculate return G, and then average them to generate an update: 


V(s;) —V(s;) a[G; —V (sp) (13.33) 


where q@ is the learning rate (Fig. 13.3). It is noteworthy that, with Monte Carlo 
methods, estimates for each state are independent of each other. It does not use 
bootstrapping. As such, Monte Carlo methods permit us to focus on a subset of 
relevant states to improve results. 

MC methods can be used to estimate state-action values as well as state values. 
Instead of following visits to a state s, we can follow from an action a taken at a visit 
to state s, and average accordingly. Unfortunately, however, it may be that certain 
state-action pairs may never be visited. For deterministic policies, only one action is 
taken from any state and therefore only one state-action pair will be estimated. The 
value of all actions from each state must be estimated for policy improvement. 
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One method of overcoming the sufficient exploration issue of Monte Carlo is to 
use exploring starts, a method that generates episodes by starting at randomly cho- 
sen actions and states. This is called a on-policy method, since we seek to improve 
the policy that is used to generate the episodes. 


Monte Carlo 


_ 





Fig. 13.3: Monte Carlo backup diagram 


13.2.6.1 Importance Sampling 


Off-policy methods are based on two separate policies: a target policy that will be 
optimized and another exploratory policy that is used to generate behavior (termed 
the behavior policy). Off-policy Monte Carlo methods typically use the notion of 
importance sampling, which is a technique for estimating expectations of one dis- 
tribution given samples from another. The key idea is to sample values more fre- 
quently that have greater impact on the expectation by shifting the probability mass. 
Note that the target policy and behavior policies can be unrelated, with either or 
both deterministic or stochastic. 


13.2.7 Temporal Difference Learning 


Temporal difference (TD) learning seeks to combine the best of both worlds of 
dynamic programming and Monte Carlo methods. In similar fashion to dynamic 
programming, it uses bootstrapping to update estimates without waiting until the 
end of the episode. Concurrently, it can learn from experience without an explicit 
model of the environment like Monte Carlo methods. The simplest TD learning 
method is one-step TD, also known as TD(O). It is based on updating the state value 
function by (Fig. 13.4): 


V(s;) — Vs) +o [ra + WV (5:41) — VOs;)] (13.34) 
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This can be written as: 
V(s;) —V(s;) + ad (13.35) 


where: 
O = 141+ W(si41) —V(s7) (13.36) 


is known as the TD error. Whereas other methods like Monte Carlo must wait to 
the end of an episode (time T) to update V(s,), this method uses only estimates of 
the next time step to form an update. That is, one-step TD estimates the return G; 
441 + W(s;41). This is an example of bootstrapping. Like Monte Carlo, TD uses a 
sample return to approximate the expected return. Like dynamic programming, TD 
uses V(s;+1) in place of Vz(s;+1). In contrast to DP methods, TD methods do not 
require a model of the environment. Furthermore, TD methods update much more 
rapidly in an online fashion whereas Monte Carlo methods must wait until the end 
of a full episode to calculate returns used in the update. For very long episodes, 
Monte Carlo may be too slow. 


Temporal Difference 


| v(s,) <= v(s,)+a(r,,, +7V(5,,.)— V(5,)) 
’ a ; 
Tied 
- . 
LALAMAS 


Fig. 13.4: Temporal difference backup diagram 


One-step TD shares some similarity with stochastic gradient descent in that it 
uses a one-step sample update rather than an expectation over the entire distribution 
of successor states. Furthermore, both can be proven to converge—one-step TD can 
be shown to asymptotically approach V,. For faster convergence, TD can use batch 
updating where the value function is updated after computing and aggregating over 
a batch of experiences. 

TD methods are not limited to single time steps, and n-step TD allows bootstrap- 
ping over multiple steps by using the update rule: 


Vi-n (s;) = Vin (s;) + O[Grt+n —. Vitn-1 (s;)] (13.37) 
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where 0 <t < T and the n-step return is given by: 


Gettin =Mr4it Wr42+...+ aoe +Y° Vitn—1(St4n) (13.38) 
aera 


future return estimate 


This n-step return is an approximation of the full return where the last term is an 
estimate of the remaining returns beyond n-steps. While one-step TD can update 
once the successor state is computed, n-step TD must wait until after n-steps of the 
episode before updating. As a tradeoff, n-step TD provides better estimates for state 
value functions with better convergence properties than one-step TD. 


Algorithm 1: One-step TD learning algorithm 
input : the policy 7 
output: the value function V 
initialize V randomly with V (terminal) = 0 
for each episode do 
initialize state s 
for each step of episode until terminal do 
take action given by z(a|s) 
observe reward r, next state s’ 
update V(s) + V(s) +a[r+yV(s') —V(s)] 
update s < 5’ 


13.2.7.1 SARSA 


Action-value methods are advantageous in model free formulations as they can op- 
erate on current states without access to the model of the environment. This is in 
contrast to state value functions which require a model since they require knowledge 
of future states and possible actions to be evaluated. We can apply the temporal dif- 
ference method to estimate the action-value function by considering the transitions 
from one state-action pair to the next state-action pair: 


O(S;, a) ax O(S;, a4) +O e441 “Tp YO(St4.1,441) _ O(st, ar] (13.39) 


Note that this update can be applied only to transitions from non-terminal states, 
since QO(s;+1,4;41) = 0 at terminal states. Because this update depends on the tuple 
(87,4¢,11-+1,51-+1, 441), it is called SARSA. It is a fully online, on-policy method that 
asymptotically converges to the optimal policy and action-value function. 
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Algorithm 2: SARSA learning algorithm 
input : the policy 7 
output: the Q function 
initialize Q(s,a) randomly with Q(terminal,all) = 0 
for each episode do 
initialize state s 
choose action a from 7(a|s) derived from Q; 
for each step of episode until terminal do 
take action a, observe reward r, next state s’ 


update Q(s,a) + QO(s,a) + a[r+ yO(s',a’) — O(s,a)] 
update s+ s’,a¢a’ 


13.2.8 Policy Gradient 


Policy gradient methods seek to optimize the policy directly without having to learn 
the state or action value function. In particular, these model-free methods use a 
parametric representation for a stochastic policy z(a|s;@) with parameters @ and 
seek to optimize expected return: 








1(a|s;@) <— max Ex |G; (13.40) 


by applying gradient ascent to update the policy parameters: 
0+ 0+aVoE;, |G,| (13.41) 


Note that this formula evaluates the expectation prior to calculating the gradient, 
which requires us to know the transitional probability distribution of 7(a|s;@). For 
analytical tractability, we can make use of the Policy Gradient Theorem, given by: 





VeEx |G] = Vo / po (x)G;(t)dx (13.42) 
= / _ Pals) Vo log po (x) Gi(x)ax (13.43) 
= E, [Vo log x(a;|s,;0)G;| (13.44) 


which allows us to express the policy gradient update rule as: 


Thus, we can update our policy without calculating the transition probability distri- 
bution of actions and states or requiring a model. 

Policy gradient methods are useful for both continuous and discrete action 
spaces. A popular method, known as REINFORCE, applies stochastic gradient de- 
scent such that only a single sequence is used for training at each step to estimate 
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parameters @. As such, it is an unbiased estimator with reduced computational bur- 
den. But because it uses a single sequence to estimate rewards, REINFORCE can 
suffer from high variance and take longer to converge. A way to reduce this variance 
is to subtract a baseline rp(s;) reward from our expected return, which teaches the 
model to increase the probability of actions that generate above average expected 
returns: 

0+ 064+ AE, [Vo log 2(a;|5;38)(G; — rp(s7)| (13.46) 


By sampling a batch of action sequences, the average reward over this batch can be 
used as the baseline reward during gradient updates for each action sequence in this 
batch. As long as the baseline reward is not dependent on the policy parameters 0, 
the estimator remains unbiased. 


Algorithm 3: The REINFORCE algorithm 
input : policy z(als; 6) 
output: optimal policy 7, 
initialize policy parameters 0 
while not converged do 
generate an episode by following policy 7 
for each step in episode until terminal do 
calculate return G 
update 0 + 06+ a7 GV log z(a|5;; 0) 


13.2.9 Q-Learning 


Q-learning is based on the notion that if the optimal Q-function is available, the 
optimal policy can be directly found by the relation: 


m*(s) = argmax Q*(s,a) (13.47) 
a 
Therefore, these methods try to learn the optimal Q-function directly by always 
choosing the best action from any state, without needing to consider the policy being 
followed. Q-learning is an off-policy TD method that updates the action-state value 
function by: 


O(s;,41) — O(S;,a;) +O |i +7 max Q(s;+1,a’) —O(S;, a5) (13.48) 
Ser 


expected future reward 


This equation is very similar to SARSA, except it estimates future expected reward 
by maximizing over future actions. In effect, Q-learning uses a greedy update to 
iterate toward the optimal Q-function, and has been shown to converge in the limit 


to O*. 
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Algorithm 4: Q-learning algorithm 
output: the Q function 
initialize Q(s,a) randomly with Q(terminal,all) = 0 
for each episode do 
initialize state s 
for each step of episode until terminal do 
choose best action a from Q (€-greedy); 
take action a, observe reward r, next state s’ 
update Q(s,a) + QO(s,a) + a[r+ ymax, Q(s’,a’) — Q(s,a)| 
update s < 3’ 


13.2.10 Actor-Critic 


Actor-critic methods, like policy gradient methods, are based on estimating a para- 
metric policy. What makes actor-critic methods different is that they also learn a 
parametric function which is used to evaluate action sequences and assist in learn- 
ing. The actor is the policy being optimized, while the critic is value function and 
can be thought of as a parametric estimate of the baseline reward in the policy gra- 
dient update equation above: 


0 — 06+akE, V9 log x(a;|s;; 0) On (57,4) — Vr( 57) (13.49) 
Se 
actor critic 


Note that we can replace the actor-critic by the advantage function: 
0+ 0+ aE, [Ve log 2(a;|s;; 0 )An(s;,a;)| (13.50) 


where Ax(5;,4;) = On(s;,4;) — Vr(5;). Similar to the REINFORCE algorithm, actor- 
critic methods can use stochastic gradient descent to sample a single sequence. In 
this instance, the Advantage function takes the form: 


An(S1,41) =" Wr (S41) —Vx (Sr) (13.51) 
— ee 


estimate for Q(s,a) 


During learning, the actor provides sample states s; and s;,, for the critic to esti- 
mate the value function. The actor then uses this estimate to calculate the advantage 
function used to update the policy parameters @. 

Since actor-critic methods rely on current samples to train the critic (as an on- 
policy model), they suffer from the fact that estimates by the actor and critic are cor- 
related. This can be alleviated by moving to off-policy training where samples are 
accumulated and stored in a memory buffer. This buffer is then randomly batched- 
sampled to train the critic. This is called experience replay, a sample efficient tech- 
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nique since individual samples can be used multiple times during training. In gen- 
eral, batch training with Actor-critic models can yield low variance estimates, but 
they will be biased with a poor critic estimator. This is in contrast to policy gradient 
models which may have high bias but are unbiased. 


Algorithm 5: Actor-Critic algorithm 


input : policy z(a|s;@), state-value function v(s;w) 
output: optimal policy 7, 


initialize policy parameters 0 and state-value weights w 
while not converged do 
initialize state s 
for each step in episode until terminal do 
take action a from (a|s;@), observe reward r, next state s 
update w + w+ BA(s,a)Vv(s;w) 
A(s,a) —r+yv(s';w) — v(s;w) 
update 8 + 04+ ay A(s,a)V log 2(a,|s;; 0) 
update s < 5’ 


/ 


13.2.10.1 Advantage Actor Critic A2C 


A way to reduce variance with online training is to use multiple threads that act 
in parallel together as a batch to train the model. Each thread uses a single sample 
and calculates an update using the advantage function. When all threads have fin- 
ished calculating their update, they are batched together to update the model. This is 
known as the synchronous advantage actor-critic model, or A2C. As an algorithm, 
A2C is highly efficient and does not require memory buffer. Furthermore, it can 
leverage modern multi-core processors very effectively to accelerate computation. 


Algorithm 6: A2C algorithm 


input : policy z(a|s;@), state-value function v(s;w) 
output: optimal policy 7, 


initialize policy parameters @ and state-value weights w 
while not converged do 
initialize state s 
for each step in episode until terminal do 
sample N actions a; from 2(a|s;@), observe reward r;, next state s’ 
update w; < w; + BA(s;,a;) Vv(s;3;) 
A(s,a) — Hy Litit+ W(s} wi) — v(si5wi) 
update 06 + 0+ a7A(s,a)V log m(a;,|s;; 0) 
update s + 3’ 
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13.2.10.2 Asynchronous Advantage Actor Critic A3C 


Rather than waiting for all threads to finish calculating an update, we can update the 
model asynchronously. As soon as a thread calculates an update, it can broadcast 
the update to other threads which immediately apply it in their calculation. This is 
known as asynchronous advantage actor-critic or A3C and has received tremendous 
attention and unprecedented success due to its light computational footprint and 
quick training times. 


13.3 Deep Reinforcement Learning Algorithms 


Deep learning methods have several important applications in reinforcement learn- 
ing. Their ability to automatically learn large, distributed representations and serve 
as universal function approximators makes them useful for modeling parametric 
policies, value functions, and advantage functions. In particular, recent advances 
in deep learning methods for sequence-to-sequence models have led to interesting 
deep reinforcement learning applications for NLP. 

Deep neural networks are notoriously unstable when used to approximate non- 
linear functions like the state value function. There are a variety of techniques to 
stabilize learning, including batch training, experience replay, and target networks. 


13.3.1 Why RL for Seq2seq 


Sequence-to-sequence (seq2seq) models, as discussed in an earlier chapter, have 
been widely used to solve sequential problems. The most common method for train- 
ing seq2seq models is called teacher forcing, where ground-truth sequences are 
used to minimize the maximum-likelihood (ML) loss at each decoding step. How- 
ever, at test time, discrete metrics like Word Error Rate (WER) are often used to 
evaluate a model. These discrete metrics are non-differentiable and cannot be used 
in an ML framework for training. It is easy to optimize for ML loss at train time 
only to yield suboptimal metrics at test time, a problem known as train-test incon- 
sistency. 

Seq2seq models suffer from another significant problem known as exposure 
bias. While teacher forcing uses a ground truth label at each step to decode the 
next element in the sequence, this ground truth label is not available at test time. 
As a result, seq2seq models can only use its predictions to decode a sequence. This 
means that errors will accumulate during output sequence generation. As a result, 
poor models may never improve during training. One way to deal with exposure 
bias 1s to use scheduled sampling during model training, where a model is first pre- 
trained using max-likelihood and then slowly shifted to its own predictions during 
training [Ken+18]. 


13.3. Deep Reinforcement Learning Algorithms 591 


Reinforcement learning offers a way to overcome these two limitations. By incor- 
porating the discrete metric like WER as a reward function, reinforcement learning 
methods can avoid the train-test inconsistency. Since the state of a RL model is 
given at each time step by the output state of the seq2seq decoder, exposure bias can 
be avoided. 

Attention-based models have recently been shown to significantly outperform 
standard seq2seq models on a variety of tasks. However, they suffer from important 
limitations with large output spaces. In NLP, it is common to use smaller, truncated 
vocabularies to reduce computational burden. Attention-based models cannot han- 
dle out-of-vocabulary words. To overcome this, pointer-generation methods have 
recently been proposed [SLM17]. These methods implement a switch mechanism 
such that when an OOV word is predicted by the model output, the input word is 
copied over to the output. Pointer-generation models are currently state-of-the-art 
for several NLP tasks. 


13.3.2 Deep Policy Gradient 


Deep policy gradient methods train a deep neural network to learn the optimal pol- 
icy. This can be accomplished with a seq2seq model where the output state of the 
decoder is used to represent the state of a model. The agent is thus modeled as the 
deep neural network (seq2seq model), where the output layer predicts a discrete ac- 
tion taken by this agent (Fig. 13.5). Policy gradient methods such as REINFORCE 
can be applied by choosing actions according to the deep neural network during 
training to generate sequences. The reward is observed at the end of the sequence 
or when an end-of-sequence (EOS) symbol is predicted. This reward can be a per- 
formance metric evaluated on the difference between the generated sequence and 
ground-truth sequence. 

Unfortunately, the algorithm must wait until the end of a sequence to update, 
causing high variance and making it slow to converge. Furthermore, at the start 
of training when the deep neural network is initialized randomly, early predicted 
actions might lead the model astray. Recent work suggest pre-training the policy 
gradient model using cross-entropy loss before switching over to the REINFORCE 
algorithm, which is a concept known as a warm start. 


Algorithm 7: The seq2seq REINFORCE algorithm 


input : Input sequences X, ground-truth output sequences Y 
output: Optimal policy 7, 


while not converged do 
select batch from X and Y 
predict sequences of actions: |a1,a2,..., ay] 
observe rewards [r1,72,..-,1N] 
calculate baseline reward r, 
calculate gradient and update the policy network 
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Fig. 13.5: DPG architecture 


13.3.3 Deep Q-Learning 


Instead of learning an estimate of the policy directly, we can use deep neural net- 
works to approximate the action value function from which we can determine an 
optimal policy. These methods are commonly known as deep Q-learning, where we 
learn to estimate a Q-function Q(s,a;@) with parameters @ by minimizing the loss 
function: 


2 
L(@) = 5 4 ymax Q(s',a’; 8) — O.0:0) (13.52) 


Taking a gradient w.r.t. 8 yields an update rule of the form: 


6+ @0+a ryimax Qs a 0) — O.0:0) VoQ(s,a; 0) (13.53) 
—— 


temporal difference 


Unfortunately, the update rule has convergence issues and can be rather unstable, 
which limits the use of deep Q-learning models by themselves. 


13.3.3.1 DQN 


The deep Q-network (DQN) algorithm is a deep Q-learning model that utilizes ex- 
perience replay and target networks to overcome instability (Fig. 13.6) [Mni+13]. 
Some have attributed the launch of the field of deep reinforcement learning to the in- 
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troduction of the DQN algorithm in 2015 [HGS15]. Experience replay, as previously 
stated, uses a memory buffer to store transitions, which are mini-batch sampled dur- 
ing training. This experience buffer helps to break correlations between transitions 
and thereby stabilize learning. 
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Fig. 13.6: DQN architecture 


A target network is an extra copy of the deep Q-network. Its weights Qarger are 
periodically copied over from the original Q-network but remain fixed during all 
other times. This target network is used to compute the temporal difference during 
the update: 


6+ O40 |rt+ymax Q(s',a'; Garger) —O(s,4;0) | VoQ(s,a:0) (13.54) 
qa 


target network 


Together, experience replay and a target network effectively smooth out learning 
and avoid parameter oscillations or divergence. Typically a finite memory buffer of 
length M is used for experience replay, such that only the most recent M transi- 
tions are stored and sampled. Furthermore, experiences are uniformly sampled from 
the buffer, regardless of significance. More recently, prioritized experience replay 
has been proposed [Sch+15a], where more significant transitions are sampled more 
frequently based on TD error and importance sampling. 
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Algorithm 8: Seq2Seq DQN algorithm 
input : Input sequences X, ground-truth output sequences Y 


output: Optimal Q function Q,. 


Initialize seq2seq model 7g 

Initialize Q network parameters 0 

Initialize target Q network parameters O;arger 
Initialize replay memory 


while not converged do 
select batch from X and Y 
sample sequences of actions from seq2seq model: |a),a2,...,an| 
collect experience (s;,a;,7;,5,) and add to replay memory 


select mini-batch from replay memory 

for each mini-batch sample do 
estimate current Q value using Q network 
estimate next best action Q value using target Q network 
save estimates to buffer 


update Q network parameters 0 by minimizing Q network loss with mini-batch 
estimates 


update seq2seq model 7g with gradient based on estimated Q values 
every K steps, copy over weights to target network Orarger = 0 


13.3.3.2 Double DON 


DQN methods suffer from the problem in that they fundamentally tend to overesti- 
mate Q-values. To see this, consider that the following relation holds: 


Se O(s',a’; Orarget ) =Q (¥/argmax QO(s',a’; Orarget ); Bare (13.55) 


qd! 


Using this, we can rewrite DQN loss function as: 


l 2 
L(@) = me rt yO (\/argmax oOo Orarget ); Boe = O(s,a; 9) (13.56) 


In this expression, it can be seen that the target network 1s used twice; first to choose 
the next best action, and then to estimate the Q value of this action. As a result, 
there is a tendency to overestimate Q-values. Double Deep Q-Learning networks 
overcome this by using two separate target networks: one to select the next best 
action, and the other to estimate Q-values given the action selected. 

Rather than introducing another target network, Double Deep Q-Networks 
(DDQN) uses the current Q-network to select the next best action and the target 
network to estimate its Q-value. The DDQN loss function can be written as: 


1 2 
L(@) = 5 r+ yO(s',argmax Q(s',a’; 8); Oarger) — O(s,a;0) (13.57) 
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DDQN alleviates the need for a third network as used in Double Deep Q-Learning 
to resolve the overestimation problem. 


13.3.3.3 Dueling Networks 


DQN and DDQN methods are useful when the action space is small. In NLP appli- 
cations, however, the action space can be equal to the size of the vocabulary, even 
though only a small subset might be feasible at any one time. Estimating the Q-value 
of each action in such a large space can be prohibitively expensive and slow to con- 
verge. Consider the fact that in some states, the choice of action may have little to 
no effect, while in other states, choice of action might be life-or-death. 

The dueling network method uses a single network to simultaneously predict 
both a state value function and advantage function that are aggregated to estimate 
the Q-function. By doing so, it avoids the need to estimate the value of each ac- 
tion choice. In one possible design, the dueling network is based on a Q-network 
architecture with CNN lower layers, followed by two separate fully connected layer 
streams whose outputs are summed together to estimate the Q-value. 


Algorithm 9: Seq2Seq double DQN algorithm 


input : Input sequences X, ground-truth output sequences Y 
output: Optimal Q function = Q,. 


Initialize seq2seq model 7g 

Initialize Q network parameters 0 

Initialize target Q network parameters O;arger 
Initialize replay memory 


while not converged do 
select batch from X and Y 
sample sequences of actions from seq2seq model: [a1 ,d2,...,4n| 
collect experience (s;,a;,7;,5,) and add to replay memory 
select mini-batch from replay memory 


for each mini-batch sample do 
estimate current Q value using Q network 
select next best action using Q-network 
estimate sample Q using target Q network 
save estimates to buffer 


update Q network parameters 0 by minimizing Q network loss with mini-batch 
estimates 


update seq2seq model 7g with gradient based on estimated Q values 
every K steps, copy over weights to target network Qrarger = 0 
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13.3.4 Deep Advantage Actor-Critic 


We have seen that the addition of a separate target network to deep Q-learning meth- 
ods can help overcome high variance and overestimation. Recall in DDQN that we 
use the current network to select an action, and the target network to evaluate the 
action. In effect, the current network serves as the actor and the target network as 
the critic, with the caveat that the two networks are identical in architecture and the 
weights of the target network are periodically synchronized to the current network. 

This need not be the case, as a different network can be trained to estimate the 
value function and act as a critic. Since deep neural networks tend to be unstable 
estimators of the state value function, deep actor-critic methods usually focus on 
estimating and maximizing the advantage function. 

Instead of the advantage function defined as the difference between the state 
value function and Q-function, we can use the TD error: 


re) — rt + Wao (S741) —Vip (s;) (13.58) 
since it can be proven that: 
E|O] = On, (s,a) — Vitg (St) (13.59) 


This value network method is known as deep advantage actor-critic (Fig. 13.7). In 
this case, only a single Q network is necessary, though for stability reasons it is best 
trained with experience replay and a target network similar to DQN. 
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Fig. 13.7: Deep Advantage Actor-Critic architecture 
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Algorithm 10: Seq2Seq AC algorithm with experience replay 
input : Input sequences X, ground-truth output sequences Y 
output: Optimal policy 7, 


Initialize actor (Sseq2seq) network, 79 
Initialize critic network 0 
Initialize replay memory 


while not converged do 
select batch from X and Y 
sample sequences of actions from Actor: [a),d2,...,4n| 
Calculate true discounted rewards: |r} ,r2,...,7n| 
collect experience (d,,v,) and add to replay memory 


sample mini-batch from replay memory 
for each mini-batch sample do 
Z compute advantage estimates from the critic network 


update critic Q network parameters 0 by minimizing critic loss over mini-batch 


update actor parameters 7g with gradient based on advantage estimates from critic 


13.4 DRL for Text 


Deep reinforcement learning methods have been recently applied to a variety of 
natural language processing tasks on text. In particular, they have been very suc- 
cessful in building conversational agents and dialogue systems. In the next sections, 
we provide a survey of different DRL methods for information extraction, text clas- 
sification, dialogue systems, text summarization, machine translation, and natural 
language generation. Many of these are based on leveraging seq2seq models used 
to either generate embeddings or as models of the target policy. This does not say 
that DRL methods are restricted to use seq2seq models, as CNNs can also be suc- 
cessfully applied. 


13.4.1 Information Extraction 


Information extraction is defined as the task of automatically extracting entities, 
relations, and events from text. In recent years, researchers have successfully ap- 
plied deep learning methods to entity extraction, including architectures that lever- 
age CNNs and RNNs [Qi+14, GHS16]. In real domains, however, it takes very 
large amounts of labeled data to learn to perform high quality extraction. Further- 
more, relation extraction quality depends on the results of entity extraction (and 
vice versa). It may also be that we care about only a subset of relations, such as in 
action task extraction. DRL methods have found applicability in addressing these 
considerations. 
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For large-scale domains, labeled training data is often the largest constraint to 
performance, as it can be prohibitively expensive to obtain accurately labeled data. 
Distant supervision is one method that seeks to alleviate this need by leveraging an 
external knowledge graph to automatically align text to extract entities or relations 
[Min+09b]. However, extraction generated in this manner is not directly labeled 
and can be incomplete. This is where reinforcement learning can be helpful. 


13.4.1.1 Entity Extraction 


For entity extraction tasks, external information can be used to resolve ambiguities 
and boost accuracy by querying similar documents and comparing extracted entities. 
This is a sequential task that can be addressed with a reinforcement learning agent 
where we model the extraction task as a Markov decision process. 

Figure 13.8 is an example of the architecture proposed by K. Narasimhan et 
al. [NYB16] based on a DQN agent. In this model, the states are real-valued vectors 
that encode the matches, context, and confidence of extracted entities from the target 
and query documents. The actions are to accept, reject, or reconcile the entities of 
two documents and query the next document. The reward function is selected to 
maximize the final extraction accuracy: 


R(s,a) = » Acc(entitYrarget(j)) — Acc(entityguery(J)) (13.60) 


entity] 


To minimize the number of queries, a negative reward is added to each step. Since 
this model is based on a continuous state space, the DQN algorithm can be trained 
to approximate the Q-function, where the parameters of the DQN are learned using 
stochastic gradient descent with experience replay and a target network to reduce 
variance. 
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Fig. 13.8: Entity extraction with DQN 
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13.4.1.2 Relation Extraction 


Consider a deep learning network for the task of relation extraction. This network is 
regarded as the DRL agent, whose role is to take as input a sequence of words in a 
sentence, and whose output are the extracted relations. If the sentences are regarded 
as states and the relations as actions, we can learn an optimal policy to perform rela- 
tion extraction. The process of extracting relations from a bag of sentences becomes 
an episode. 

Figure 13.9 depicts a deep policy gradient approach to this relation extraction 
task. The reward function is given by the accuracy of the predicted relations in 
a bag in comparison with a set of gold labels. The REINFORCE algorithm has 
been applied [Zen+18] to optimize the policy of this model by defining the reward 
function of a state s; to be: 

R(si) =" Tn (13.61) 


where vis the number of sentences in the bag and r,, is either +1 or —1. The objective 
function for the policy gradient method is: 


J(@) = Bg, 055,52 (Si) (13.62) 


This leads to a gradient update of the form: 


06+ 064+VJ(8) = y y V p(ai|si; 8) (R(s;) — rp) (13.63) 
i=1 j=l 


where the baseline rp is given by: 
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Fig. 13.9: Relation extraction with DPG 
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13.4.1.3 Action Extraction 


The task of extracting action sequences from text is challenging in that they usually 
are highly affected by context. Traditional methods depend on a set of templates 
which do not generalize well to natural language. Sequence labeling methods do 
not perform well, since there is only a subset of sequences that can be considered 
meaningful actions. The action extractor can be modeled as a DRL agent, where the 
states are regarded as word sequences, and actions are the set of labels associated 
with the word sequence. This agent can learn an optimal labeling policy by training 
a DQN model. Figure 13.10 shows the architecture proposed by Feng et al. [FZK18] 
called EASDRL that is based on first extracting action names and then the action 
targets. To do so, this architecture defines two Q-functions associated with separate 
CNN networks for modeling the action name Q(s,a) and action target Q(S,a), and is 
trained using a variant of experience replay that weighs positive-reward transitions 
higher. 
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Fig. 13.10: Action extraction with DON 


13.4.1.4 Joint Entity/Relation Extraction 


Usually, entity extraction occurs as a precursor to relation extraction. They can be 
considered interdependent tasks, since the quality of relation extraction usually de- 
pends on the quality of extracted entities. Given this sequential nature, it is possible 
to use reinforcement learning to jointly learn and optimize for both tasks concur- 
rently. Figure 13.11 shows a DRL architecture [Fen-+ 17] based on a deep Q-learning 
agent. In this model, the current state s is the entity extractor output from a Bi-LSTM 
with attention Art(X; 6; ), and the transition state s’ is the relation extraction output 
from a Tree-LSTM Tree(X; 02). The actions are defined over the set (a, a2,a3,d4) 
where a, and dp classify the existence of a relation mention, and a3 and ag clas- 
sify the type of relation mention. In other words, the DRL agent combines the tasks 
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of entity extraction, relation mention classification, and relation classification. The 
DQL model is trained using stochastic gradient descent. 
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Fig. 13.11: Joint entity/relation extraction with DQL 


13.4.2 Text Classification 


Deep learning for text classification has mainly focused on learning representations 
of words and sentences that can effectively capture semantic context and structure. 
Current methods, however, are unable to automatically learn and optimize structure, 
since they are trained explicitly using supervised input or treebank annotations. In 
contrast, DRL can be used to build hierarchical-structure sentence representations 
without the need for annotations. 

Figure 13.12 shows an architecture that consists of three components: a policy 
network, a representation model, and a classification network [ZHZ18]. The policy 
network is based on a stochastic policy whose states are vectors representations 
of both word level and phrase level structure. These vectors are the output of the 
representation model which consists of a two-level hierarchical LSTM that connects 
a sequence of words to form a phrase and a sequence of phrases to form a sentence 
representation. The actions of the policy network label whether a word 1s inside or 
at the end of a phrase. Whereas the policy network focuses on building sentence 
representations that capture structure, the classification network takes the output 
from the representation model and uses it to perform the classification task. 

To jointly train the policy and classification networks, the hierarchical LSTM is 
first initialized and pre-trained using the cross-entropy loss of the classifier network, 
given by: 


K 
L=— | > p,X)log POX) (13.65) 
XD y=1 
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where p and P are the target and predicted distributions, respectively. The parame- 
ters for the representation model and classifier networks are then held constant and 
the policy network is pre-trained using the REINFORCE algorithm. After the warm 
start, all three networks are jointly trained until convergence. 
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Fig. 13.12: Text classification with DPG 


13.4.3 Dialogue Systems 


Dialogue systems have become increasingly popular as chatbots gain widespread 
application across social media and customer service. Developing an intelligent 
dialogue system has always been a major goal of AI, dating back to the Turing 
Test. Dialogue agents must perform a pipeline of multiple tasks, including natural 
language understanding, state tracking, dialogue policy, and natural language gen- 
eration. Dialogue systems have been modeled successfully as partially observable 
Markov decision processes. 

Slot-filling dialogues are an important subclass of dialogue systems that involve 
filling-in a set of predefined slots in response to user dialogue and context. In these 
systems, the relationship between a chatbot and user is analogous to an RL agent 
and its environment. Conversational dialogue becomes an optimal decision mak- 
ing problem, where the reward function can be defined as a successful interaction 
between chatbot and user. 

There are several fundamental problems with dialogue systems. The biggest 1s- 
sue is the credit assignment problem, where error propagation through the pipeline 
may make it near impossible to determine the component source of error. For in- 
stance, poor performing dialogue policy may be due to incorrect state tracking or 
low-quality NLU. Similarly, the reliance of downstream components on upstream 
tasks makes optimization particularly difficult. For instance, a tweak to the state 
tracker may lead to sub-optimal dialogue policy. In an ideal case, the entire pipeline 
is trained at once in an end-to-end manner. For these reasons, deep RL methods are 
finding significant use for modeling dialogue systems. 
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A DQN agent has been successfully applied to train a dialogue system [ZE16, 
GGL18] that unifies state tracking and dialogue policy and treats both as actions 
available to the RL agent. The architecture learns an optimal policy that generates a 
verbal response or updates the current dialogue state. Figure 13.13 depicts the DQN 
model, which uses an LSTM network to generate a dialogue state representation. 
The output of the LSTM serves as input to a set of policy networks in the form 
of multilayer perceptron networks representing each possible action. The output of 
these networks represents the action-state value functions for each action. 
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Fig. 13.13: Dialogue system with DQN 


Due to the high-dimensional state and action spaces, a large number of labeled 
dialogues are typically required to train dialogue systems. To overcome this need for 
training data, a two-stage deep RL method has been proposed [Fat+ 16] that uses an 
actor-critic architecture where the policy network is first supervised-trained using a 
small number of high-quality dialogues via categorical cross-entropy to bootstrap 
learning. The value network can then be trained using the deep advantage actor- 
critic method. 


13.4.4 Text Summarization 


Text summarization is an interesting NLP task that seeks to automatically generate 
natural language summaries of input text in human-readable form. It has widespread 
use across a variety of industries and comes in two categories: extractive and ab- 
stractive summarization. In the extractive case, it seeks to eliminate superfluous text 
and keep only the most relevant words while maintaining natural language form. 
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In the abstractive case, it seeks to provide a paraphrased summary of the relevant 
points in the text. 

Recall-Oriented Understudy for Gisting Evaluation (ROUGE) is the standard 
quality measure most often used for text summarization tasks. By definition, 
ROUGE-1 measures the unigrams that are shared between a predicted summa- 
rization and the ground-truth reference text. ROUGE-2 measures the bigrams that 
are shared, and ROUGE-L measures the longest common substring (LCS) between 
prediction and ground-truth. For each of these measures, precision and recall are 
typically quoted. The problem with ROUGE is that they provide little information 
about the human readability of predictions, which are usually captured by a measure 
like perplexity for a language model. 

DQN has been successfully applied to the task of extractive text summarization 
[LL17, PXS17b, Ce+18b]. Figure 13.14 shows the architecture where states denote 
the current (partial) text summary, actions denote adding a sentence to this summary, 
and ROUGE is used as the reward. In this architecture, a sentence is represented as 
a concatenation of a document vector (DocVec), sentence vector (SentVec), and 
position vector (PosVec). 
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Fig. 13.14: Text summarization with DON 


Attention-based deep learning networks have found significant traction in ab- 
stractive text summarization tasks. But despite their high ROUGE scores, they often 
generate unnatural summaries. This has opened the door to deep RL methods that 
can incorporate a mixed training objective: 


Ligtred = OL, + (1 = O)Lm (13.66) 
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that incorporates both the teacher-forcing maximum likelihood function: 


Lint = — D_ log p(y: |¥1,925---5¥1-15%) (13.67) 


n 
t=1 


and a policy gradient objective: 


Ly, =—|r—rp| > log Py: Ly1,y2; sees ei) (13.68) 
f=] 


where the reward r is a discrete objective like ROUGE. 


13.4.5 Machine Translation 


One of the recent breakthroughs in neural machine translation has been the use of 
seq2seq models. As noted above, teacher-forcing is the primary method to train 
these networks. These models exhibit exposure bias during prediction time. Fur- 
thermore, decoders cannot generate target sequences of interest with specific objec- 
tives. This is especially so if beam-search is employed, which tends to focus more 
on short-term rewards, a concept termed myopic bias. Machine translation is most 
often evaluated based on the discrete BLEU measure, which creates a train-test mis- 
match. 

Deep RL models have been proposed to overcome some of these shortfalls. A 
deep PG model [Li+16] based on the REINFORCE training algorithm can address 
the non-differentiable nature of the BLEU metric. However, REINFORCE suffers 
from the inability to learn policies in large action spaces as is the case with language 
translation. 

More recently, an actor-critic model has been proposed by using a decoding 
scheme that incorporates longer-term rewards through a value function estimate 
[Bah+ 16a]. In this model, the main sequence prediction model is the actor/agent 
and the value function acts as a critic. The current sequence prediction output is the 
state, and candidate tokens are actions of the agent. The critic is implemented by a 
separate RNN and is trained on the ground-truth output using temporal difference 
methods, with a target critic used to reduce variance. 


13.5 DRL for Speech 


Deep neural networks have significantly improved the performance of speech recog- 
nition systems nowadays. When they are used as part of a hybrid system together 
with GMMs or HMMs, alignment of the acoustic model is a necessity during train- 
ing. This can be avoided when deep neural networks are used in end-to-end systems 
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that learn transcriptions by directly maximizing the likelihood of the input data 
[YL18]. Such systems, while currently leading state-of-the-art performance, still 
suffer from a variety of limitations. 

Drawing from the experiences with text, researchers and practitioners have begun 
to apply deep reinforcement learning methods to speech and audio, including tasks 
such as automatic speech recognition, speech enhancement, and noise suppression. 
In the near future, we expect to see wider adoption of deep RL techniques in other 
aspects of speech, including applications in speaker diarization, speaker tone detec- 
tion, and stress analysis. 


13.5.1 Automatic Speech Recognition 


The task of automatic speech recognition (ASR) is in many ways similar to machine 
translation. ASR most often uses CTC maximum likelihood learning while measur- 
ing performance with a discrete measure like word-error rate (WER). As a result, 
train-test mismatch is a problem. Furthermore, as a sequence prediction task, ASR 
suffers from exposure bias since it will be trained on ground-truth labels that are not 
available at prediction time. 

A deep RL approach using policy gradients has been shown to be effective 
in [ZXS17] overcoming these limitations (Fig. 13.15). In this approach, the ASR 
model is regarded as the agent, and training samples as the environment. The policy 
To (y|x) is parameterized by @, the actions are considered to be generated transcrip- 
tions, and the model state is the hidden data representation. The reward function is 
taken to be WER. The policy gradient is updated by the rule: 


06+ 0+ aVg log Po(y|x)|r— rp] (13.69) 


13.5.2 Speech Enhancement and Noise Suppression 


Machine learning speech enhancement methods have been in existence for quite a 
while. Enhancement techniques usually fall under four subtasks: voice-activity de- 
tection, signal-to-noise estimation, noise suppression, and signal amplification. The 
first two provide statistics on the target speech signal while the latter two use these 
statistics to extract the target signal. This can be naturally thought of as a sequen- 
tial task. A deep RL method based on policy gradients has been proposed [TSN17] 
for the task of speech enhancement with an architecture that is based on using an 
LSTM network to model a filter whose parameters 0 are determined by a learned 
policy 7g. In this model, the filter is the agent, the state is a set of filter parameters, 
and actions are increases or decreases in a filter parameter. The reward function 
measures the mean-square error between the filter output and a ground-truth clean 
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Fig. 13.15: Automatic speech recognition with DPG 





signal sequence. This policy gradient model, trained using the REINFORCE algo- 
rithm, can improve signal-to-noise ratio with no algorithmic changes to the baseline 
speech-enhancement process. Furthermore, by incorporating a deep reinforcement 
agent, the filter can adjust to changing underlying conditions through dynamic pa- 
rameter adaptation. 


13.6 Case Study 


In this case study, we will apply the deep reinforcement learning concepts of this 
chapter to the task of text summarization. We will use the Cornell NewsRoom Sum- 
marization dataset. The goal here is to show readers how we can use deep reinforce- 
ment learning algorithms to train an agent that can learn to generate summaries of 
these articles. For the case study, we will focus on deep policy gradient and double 
deep Q-network agents. 


13.6.1 Software Tools and Libraries 


We will use the following packages in this case study: 


e TensorFlow is an open-source software library for dataflow programming across 
a range of tasks. It is a symbolic math library, and is also used for machine 
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learning applications such as neural networks. It is used for both research and 
production at Google. 

e RLSeq2Seq is an open-source library which implements various RL techniques 
for text summarization using sequence-to-sequence models. 

e pyrouge is a python interface to the perl-based ROUGE-1.5.5 package that com- 
putes ROUGE scores of text summaries. 


13.6.2 Text Summarization 


To measure the performance of machine generated summaries, we will use ROUGE, 
which stands for Recall-Oriented Understudy for Gisting Evaluation. It is a set of 
metrics used to evaluate automatic summarization of texts as well as machine trans- 
lation. It works by comparing an automatically produced summary or translation 
against a set of reference summaries (typically human-produced). 

ROUGE-N, ROUGE-S, and ROUGE-L are measures of the granularity of texts 
when comparing between the system predicted summaries and reference summaries. 
For example, ROUGE-1 refers to overlap of unigrams between the system summary 
and reference summary. ROUGE-2 refers to the overlap of bigrams between the 
system and reference summaries. Let’s take the example from above. Let us say we 
want to compute the ROUGE-?2 precision and recall scores. For ROUGE, recall is 
a measure of how much of the reference summary is the captured by the system 
summary. 


13.6.3 Exploratory Data Analysis 


The Cornell Newsroom dataset consists of 1.3 million articles and summaries writ- 
ten by news authors and editors from 38 major publications between 1998 and 2017. 
The dataset is split into train, dev, and test sets of 1.1m, 100k, and 100k samples. 
A sample of the dataset is provided below: 


Story: Coinciding with Mary Shelley’s birthday week, this Scott family affair produced by 
Ridley for director son Luke is another runout for the old story about scientists who cre- 
ate new life only to see it lurch bloodily away from them. Frosty risk assessor Kate Mara’s 
investigations into the mishandling of the eponymous hybrid intelligence (The Witch’s still- 
eerie Anya Taylor-Joy) permits Scott Jr a good hour of existential unease: is it the placid 
Morgan or her intemperate human overseers (Toby Jones, Michelle Yeoh, Paul Giamatti) 
who pose the greater threat to this shadowy corporation’s safe operation? Alas, once that 
question is resolved, the film turns into a passably schlocky runaround, bound for a guess- 
able last-minute twist that has an obvious precedent in the Scott canon. The capable cast 
yank us through the chicanery, making welcome gestures towards a number of science- 
fiction ideas, but cranked-up Frankenstein isn’t one of the film’s smarter or more original 
ones. 
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Summary: Ridley and son Luke turn in a passable sci-fi thriller, but the horror turns to 
shlock as the film heads for a predictable twist ending 


For our case study, we will use subsets of 10,000/1000/1000 articles and sum- 
maries from the Cornell Newsroom dataset for our training, validation, and test sets, 
respectively. We will tokenize and map these data sets using 100-dim embeddings 
generated with word2vec. For memory considerations, we limit our vocabulary to 
50,000 words. 


13.6.3.1 Seq2Seq Model 


Our first task 1s to train a deep policy gradient agent that can produce summaries 
of the articles. Before we do so, we pre-train the seq2seq model using maximum 
likelihood loss, an encoder and decoder layer size of 256, batch size of 20, and 
adagrad with gradient clipping for 10 epochs (Fig. 13.16). After pre-training, we 
evaluate this model on test set to find the results shown in Table 13.1. 


Table 13.1: ROUGE metrics for Seq2Seq trained on MLE 


F-score| Precision | Recall 


ROUGE-1} 15.6 20.6 14.5 
ROUGE-2]| 1.3 1.6 1.3 
ROUGE-L]} 14.3 19.0 13.3 





Seq2seq: at 90-years old this tortoise has never moved better despite a horrific rat attack 
that caused legs 

Reference: a 90-year old tortoise was given wheels after a rat attack caused her to lose her 
front legs 


Seq2seq: a city employee in baquba the capital of diyala province vividly described 
his ambivalence 

Reference: iraqis want nothing more than to have u.s. soldiers leave iraq but there is 
nothing they can less afford 


Seq2seq: google reported weaker than expected results thursday for its latest quarter 


Reference: the tech giant ’s shares rose after it reported a smaller than expected rise in 
sales for its latest quarter 


In comparison with the reference summaries, the generated summaries are fair 
but leave some room for improvement. 


13.6.3.2 Policy Gradient 


Let’s apply a deep policy gradient algorithm to improve our summaries (Fig. 13.17). 
We set our reward function to ROUGE-L FI score and switch from MLE loss to RL 
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Fig. 13.16: Seq2Seq model for text summarization 


loss. We continue training for 8 epochs, after which we evaluate the RL-trained 
model on the test set to find the results shown in Table 13.2. 


Table 13.2: ROUGE metrics for DPG 


F-score| Precision | Recall 


ROUGE-1} 22.4 19.6 bh) 
ROUGE-2 8.5 
ROUGE-L| 17.6 15.5 28.0 





AS we increase training, we expect to see the generated summaries become even 
closer aligned to human-generated language. 


DPG: apple has disclosed the details of a streaming music service plan to recording com- 
panies sources say 

Reference: apple executives have spoken to the top four recording companies about plans 
to offer a streaming music service free of charge to consumers multiple music industry 
sources told cnet 


DPG: conservative pundit glenn beck says the obama administration is using churches 
and other faith based groups to promote its climate change agenda 
Reference: glenn beck says obama uses churches on climate change green house 


DPG: the zoo in georgia ’s capital has reopened three months after a devastating flood 
that killed more than half of its 600 animals including about 20 tigers lions and jaguars 
Reference: a georgia zoo that had half its animals killed during floods in june has reopened 
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Fig. 13.17: Deep policy gradient for text summarization 


13.6.3.3 DDQN 


Let’s see if we can improve on the results above using a double deep Q-learning 
agent. We start as before by pre-training the seq2seq language model using maxi- 
mum likelihood loss for 10 epochs. We then train the double deep-Q network for 8 
epochs using a batch size of 20, replay buffer of 5000 samples and updating the tar- 
get network every 500 iterations. For better results, we will first pre-train the DDQN 
agent with a fixed actor for a single epoch. When we then evaluate the resulting 
model on the test set, we find the results in shown in Table 13.3. 


Table 13.3: ROUGE metrics for DDQN 
ROUGE-1 


Recall 
ROUGE-2 


34.6 28.8 me) 
21.4 19.0 | 31.1 
ROUGE-L} 30.4 23.) | 477 


DDQN: the commander of us forces in the middle east said that the refusal to follow orders 
occurred during the battle for the recently liberated town of manbij syria 

Reference: a top us general said tuesday that isis fighters defied their leader ’s orders to 
fight to the death in a recent battle instead retreating to the north 





DDQN: an online discussion of the washington area rental market featuring post columnist 
sara gebhardt 

Reference: welcome to apartment life an online discussion of the washington area rental 
market featuring post columnist sara gebhardt 
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DDQN: albania has become the largest producer of outdoor grown cannabis in europe 
Reference: albania has become the largest producer of outdoor grown cannabis in europe 


The DDQN agent outperforms the DPG agent for the chosen parameters. There 
are a myriad of possibilities to improve results further—we could use scheduled 
or prioritized sampling, intermediate rewards, and/or some form of attention at the 
encoder or decoder. 


13.6.4 Exercises for Readers and Practitioners 


1. How would you combine a DQN agent for the task of text classification when 
using a seq2seq model with soft attention? 

2. Does it make sense to use two separate target networks for the double DON 
agent? Why or why not? 

3. What kind of deep neural networks would we use for the Q-learning model? 
Why would or would not CNNs be appropriate? 
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Future Outlook 


Predicting the future of AI is no more possible today than it has been in years past. 
Furthermore, the farther into the future we project, the greater the uncertainty. In 
general, some things may go exactly as expected (improvements in computational 
speed), some expectations may have slight variability (the dominant deep learning 
architectures), and others are maverick innovations that are unlikely to be predicted 
(the intersection of big data, computational speed, and emergence of deep learning 
all at the same time). At the conclusion of this book, we would like to provide our 
predictions based on the current trajectories, trends, and usefulness of the research 
we’ve discussed. We reject all claims to be considered soothsayers or even reliable 
parties in these projections. We attempt to only provide considerations for the reader 
at the conclusion of these topics and suggest areas of awareness over upcoming 
years. 


End-to-End Architecture Prevalence 


Given the success of many end-to-end approaches in both NLP and speech, we 
expect that more will move towards these architectures. One of the areas where 
these approaches lack robustness is in the tuning to particular environments, for 
example, the usefulness of a lexicon model in the ASR hybrid architecture or in 
the adaptation of language models to new domains. This is an area that must be 
addressed for deep learning to make a significant impact in domains where training 
data is costly or unavailable. 


Transition to AI-Centric 


One of the simplest projections is that more companies will shift to or center around 
an AI-Centric strategy. Many of the leading tech companies—for example, Google, 
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Facebook, and Twitter—have moved in this direction, and this trend will likely 
continue into many other large and mid-sized companies. This shift will introduce 
machine learning into every level of software development and with it the need for 
tools and processes to ensure reliability and generality. Some have coined the term 
“Software 2.0!” in light of this shift. Transitioning to this state will require increased 
rigor around data, interpretability of models, an increased focus on model security, 
and resiliency to adversarial scenarios. 


Specialized Hardware 


Specialized hardware will become more common. This pattern of development 
is fairly common with utilization of ASIC (application-specific integrated circuit) 
hardware for cryptocurrency mining or image processors embedded in smartphones. 
The introduction of TPUs has been one of the first cases where dedicated physical 
hardware has been created specifically for deep learning. The introduction of the 
Apple All chip is another example of specialized hardware to support neural net- 
works on mobile devices. 


Transition Away from Supervised Learning 


We expect the focus of machine learning to shift. Deep learning has seen the largest 
improvements with supervised data; however, the costs associated with creating 
large, labeled datasets are often prohibitively expensive. In many scenarios, large 
unlabeled sources exist that can be used by unsupervised algorithms, and we expect 
a greater concentration of algorithms in this area, as seen in the progression of word 
embeddings and language models. 


Explainable AI 


Though end-to-end deep learning techniques are powerful and can result in impres- 
sive performance metrics such as accuracy, they suffer from interpretability. Many 
applications in the financial world (such as loan applications or conduct surveil- 
lance) or in healthcare (like predicting disease) need models and predictions to be 
explainable. There has been a shift in the industry towards explainable AI (XAI). 
Many techniques such as Local Interpretable Model-agnostic Explanation (LIME), 
Deep Learning Important FeaTures (DeepLIFT), SHapley Additive exPlanation 
(SHAP) to name a few have been very promising in providing model-agnostic expla- 


' https://medium.com/@karpathy/software-2-0-a64152b37c35. 
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nations for individual predictions as well as summarization of models. Innovations 
such as these and others will be necessary to overcome the hurdles of interpretability 
of models and trust of AI. 


Model Development and Deployment Process 


There is a trade-off in deep learning between the ease of experimentation during 
model development and deployment of these models in a high performing, low- 
latency production with highly optimized code. This trade-off is more prevalent in 
NLP and speech recognition models as they are complex dynamic graphs as com- 
pared to the preferred static graphs for optimized performance at runtime. Frame- 
works such as PyText, which help to tune pre-built models, perform experiments in 
a rapid manner, provide pre-built workflows for model designers and engineers, and 
support easy deployment of models to production environments with minimum in- 
tervention, will soon become the necessary part of the development process. Model 
testing and quality assurance is another aspect of the development and deployment 
process that needs to be adjusted to accommodate complex deep learning models. 
Google’s recent research paper “The ML Test Score: A Rubric for ML Production 
Readiness and Technical Debt Reduction” proposes a great framework towards test- 
ing these complex deep learning based systems. 


Democratization of Al 


AI and deep learning are used by a still very small but rapidly growing group of 
researchers, educators, experts, and practitioners. To make them accessible to the 
masses through applications, tools, or education, there needs to be a change in at- 
titude, policies, investment, and research, especially from top companies and uni- 
versities. This phenomenon is called the “democratization of AI.” Many companies 
such as Google, Microsoft, and Facebook, as well as many universities such as MIT, 
Stanford, and Oxford are contributing to software tools, libraries, datasets, courses, 
etc. that are freely available on the web. The positive trend in this direction will play 
a huge role in transforming lives through AI. 


NLP Trends 


Language models can be pre-trained on a large corpus of unlabeled data, giving it a 
considerable advantage. Language models are now considered to add enormous ben- 
efits for many NLP tasks. Language model embeddings provide features for com- 
plex tasks and have shown to provide improvements over many tasks on the state-of- 
the-art methods. Using adversarial methods to either understand the models, analyze 
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fail cases, or improve the robustness of models is becoming a trend in deep learning 
research. Moving towards under-resourced languages and using deep learning tech- 
niques such as transfer learning is another area that many researchers are focusing 
especially in tasks such as machine translation. 

One of the most curious areas of development is in the area of reinforcement 
learning. Instead of collecting data, training a model, putting it into production, and 
testing the result, an agent could be created to interact with the environment (real or 
synthetic) and learn based on its experience. Overall, we see the progression moving 
from supervised to unsupervised to reinforcement techniques. 


Speech Trends 


Many of the end-to-end deep learning techniques are able to outperform traditional 
hybrid HMM-based models with less tuning and linguistic expertise. These models 
perform very well in scenarios where training data is widely available, typically in 
general speech recognition tasks. However, they tend to struggle when context is 
crucial to prediction. Additionally, the continued pursuit of fusing speech and NLP 
is a direction likely to continue, with end-to-end learning taking the lead. Recent 
advancements are focusing on incorporating domain information into the decoding 
procedure via language model fusion for contextualized recognition. 

Other areas where speech recognition still struggles are with acoustic environ- 
ments and speaker-specific differences such as accents. Leveraging generated data 
from speech-to-text systems is gaining traction, providing simulated environments 
and speakers for improved robustness. We expect the incorporation of speech-to-text 
systems, similar to GAN workflows, to continue to improve, and will potentially be 
incorporated more fully into reinforcement workflows. 


Closing Remarks 


We hope that the readers have found the information in this book both informative 
and helpful. Deep learning has heavily impacted NLP and speech in the past few 
years, and the trend seems to be gaining speed. We have hopefully enabled the 
readers to understand both fundamental and advanced techniques that deep learning 
offers, while also showing how to practically apply them. 
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